You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/03 13:57:58 UTC

[GitHub] [hudi] wosow opened a new issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

wosow opened a new issue #2409:
URL: https://github.com/apache/hudi/issues/2409


   
    Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables   , no errors happening
   
   
   **Environment Description**
   
   * Hudi version :0.6.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.7.5
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   code as follows:
           batchDF.write.format("org.apache.hudi")
             .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "MERGE_ON_READ")
             .option(DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert")
             .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, "10")
             .option("hoodie.datasource.compaction.async.enable", "true") 
             .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rec_id")
             .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "modified")
             .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "ads") 
             .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hiveTableName) 
             .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") 
             .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt")
             .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
             .option(HoodieWriteConfig.TABLE_NAME, hiveTableName)
             .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") 
             .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") 
             .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://0.0.0.0:10000") 
             .option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "")
             .option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "")
             .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, classOf[MultiPartKeysValueExtractor].getName)
             .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())
             .option("hoodie.insert.shuffle.parallelism", "10")
             .option("hoodie.upsert.shuffle.parallelism", "10")
             .mode("append")
             .save("/data/mor/user")
   
   only create user_ro ,  no user_rt
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
        There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0.
   
        In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
       There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
          In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured,  asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-774509647


   @wosow : also, few quick questions as we triage the issue. 
   - Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. 
   - Is this affecting your production? trying to gauge the severity. 
   - Or you are trying out a POC ? and this is the first time trying out Hudi. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-809990967


   @wosow looks like this is not a real issue in production. For your questions on async compaction, have you taken a look at this blog -> https://hudi.apache.org/blog/async-compaction-deployment-model/ ? If your questions are still unanswered after reading this blog, please ping here and we will answer them


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755894869


   > It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message
   > 
   > "Trying to sync hoodie table "
   
   error as follows:
   there is no sql about  creating _rt table , only _ro table
   ```
   ----------------------------------------------------------------------------------------------------------------------------------------------
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
   21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
   21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
   21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since config is empty
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_all_databases	
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*	
   21/01/07 13:23:11 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=dw pat=*	
   21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as Option{val=null}
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit Option{val=[20210107132154__deltacommit__COMPLETED]}
   21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from /data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
   21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. Creating it
   21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing partitions for api_trade_ro
   21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be null
   21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in /data/stream/mor/api_trade,FS :DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root (auth:SIMPLE)]]
   21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 1
   21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads tbl=api_trade_ro
   21/01/07 13:24:09 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_partitions : db=ads tbl=api_trade_ro	
   21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
   21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table api_trade_ro
   21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE `ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION '/data/stream/mor/api_trade/dt=2021-01' 
   21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
   21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for api_trade_ro
   21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210107132154
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
   	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
   	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
   	at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
   	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
   	at com.sun.proxy.$Proxy41.get_table(Unknown Source)
   	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
   	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
   	at com.sun.proxy.$Proxy42.getTable(Unknown Source)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
   	... 48 more
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Shutting down the object store...	
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Metastore shutdown complete.	
   21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> ads, hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, hoodie.datasource.write.precombine.field -> modified, hoodie.datasource.hive_sync.partition_fields -> dt, hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> api_trade, hoodie.datasource.hive_sy
 nc.jdbcurl -> jdbc:hive2://0.0.0.0:10000, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.query.type -> snapshot, hoodie.bloom.index.update.partition.path -
   ----------------------------------------------------------------------------------------------------------------------------------------------
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-893436528


   @dude0001 : Did you open up a new github issue? usually table rename exception happens if your table name in hudi mismatches w/ that in hive. Is there any case sensitivity that could be an issue? If your table name has capital letters, can you try all small letters. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
       There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
   
          In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured,  asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-771397330


   @wosow Were you able to resolve your issue ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-891920635


   @dude0001 : Hey if you don't mind, can you create a new github issue. do not want to pollute this issue. since yours is not spark streaming. We can add a link to this issue calling it out as related. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755894869


   > It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message
   > 
   > "Trying to sync hoodie table "
   
   error as follows:
   there is no sql about  creating _rt table , only _ro table
   ----------------------------------------------------------------------------------------------------------------------------------------------
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
   21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
   21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
   21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since config is empty
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_all_databases	
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*	
   21/01/07 13:23:11 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=dw pat=*	
   21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as Option{val=null}
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit Option{val=[20210107132154__deltacommit__COMPLETED]}
   21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from /data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
   21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. Creating it
   21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing partitions for api_trade_ro
   21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be null
   21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in /data/stream/mor/api_trade,FS :DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root (auth:SIMPLE)]]
   21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 1
   21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads tbl=api_trade_ro
   21/01/07 13:24:09 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_partitions : db=ads tbl=api_trade_ro	
   21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
   21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table api_trade_ro
   21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE `ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION '/data/stream/mor/api_trade/dt=2021-01' 
   21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
   21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for api_trade_ro
   21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210107132154
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
   	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
   	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
   	at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
   	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
   	at com.sun.proxy.$Proxy41.get_table(Unknown Source)
   	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
   	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
   	at com.sun.proxy.$Proxy42.getTable(Unknown Source)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
   	... 48 more
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Shutting down the object store...	
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Metastore shutdown complete.	
   21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> ads, hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, hoodie.datasource.write.precombine.field -> modified, hoodie.datasource.hive_sync.partition_fields -> dt, hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> rec_id, hoodie.table.name -> api_trade, hoodie.datasource.hiv
 e_sync.jdbcurl -> jdbc:hive2://10.228.86.7:10000, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.query.type -> snapshot, hoodie.bloom.index.update.partition.path -
   ----------------------------------------------------------------------------------------------------------------------------------------------
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-813864327


   @wosow Did you get a chance to read the blog ? Please let us know if this issue is still valid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] dude0001 commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

dude0001 commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-891885608


   @nsivabalan that is one difference in my duplication steps. I am currently not using a Spark Streaming job. I'm reading from our raw zone in S3 that contains parquet files containing change data capture events from transactional databases. I'm trying to upset our cleansed zone also in S3 so that it contains the latest version of each row. If I turn off sync, it works fine otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-853602852


   Closing this ticket due to inactivity. @wosow Please feel free to re-open if you need more information. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755894869


   > It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message
   > 
   > "Trying to sync hoodie table "
   
   error as follows:
   there is no sql about  creating _rt table , only _ro table
   ----------------------------------------------------------------------------------------------------------------------------------------------
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 371
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 337
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 404
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev03:18850 in memory (size: 73.0 KB, free: 2.5 GB)
   21/01/07 13:23:05 INFO BlockManagerInfo: Removed broadcast_16_piece0 on bigdatadev02:6815 in memory (size: 73.0 KB, free: 3.5 GB)
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 333
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 418
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 385
   21/01/07 13:23:05 INFO ContextCleaner: Cleaned accumulator 410
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:08 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:10 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
   21/01/07 13:23:10 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is MYSQL
   21/01/07 13:23:10 INFO ObjectStore: Initialized ObjectStore
   21/01/07 13:23:11 INFO HiveMetaStore: Added admin role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: Added public role in metastore
   21/01/07 13:23:11 INFO HiveMetaStore: No user is added in admin role, since config is empty
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_all_databases
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_all_databases	
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=default pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=default pat=*	
   21/01/07 13:23:11 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_functions: db=dw pat=*
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_functions: db=dw pat=*	
   21/01/07 13:23:11 INFO HiveSyncTool: Trying to sync hoodie table api_trade_ro with base path /data/stream/mor/api_trade of type MERGE_ON_READ
   21/01/07 13:23:11 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:23:11 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last compaction commit as Option{val=null}
   21/01/07 13:23:11 INFO HoodieHiveClient: Found the last delta commit Option{val=[20210107132154__deltacommit__COMPLETED]}
   21/01/07 13:23:12 INFO HoodieHiveClient: Reading schema from /data/stream/mor/api_trade/dt=2021-01/350a9a01-538c-4a7e-8c17-09d2cdc85073-0_0-20-85_20210107132154.parquet
   21/01/07 13:23:12 INFO HiveSyncTool: Hive table api_trade_ro is not found. Creating it
   21/01/07 13:23:12 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:23:12 INFO HoodieHiveClient: Executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `ads`.`api_trade_ro`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `topic` string, `kafka_partition` string, `kafka_timestamp` string, `kafka_offset` string, `current_time` string, `kafka_key` string, `kafka_value` string,  `modified` string, `created` string, `batch_time` string) PARTITIONED BY (`dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/data/stream/mor/api_trade'
   21/01/07 13:24:09 INFO HiveSyncTool: Schema sync complete. Syncing partitions for api_trade_ro
   21/01/07 13:24:09 INFO HiveSyncTool: Last commit time synced was found to be null
   21/01/07 13:24:09 INFO HoodieHiveClient: Last commit time synced is not known, listing all partitions in /data/stream/mor/api_trade,FS :DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-398057260_1, ugi=root (auth:SIMPLE)]]
   21/01/07 13:24:09 INFO HiveSyncTool: Storage partitions scan complete. Found 1
   21/01/07 13:24:09 INFO HiveMetaStore: 0: get_partitions : db=ads tbl=api_trade_ro
   21/01/07 13:24:09 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_partitions : db=ads tbl=api_trade_ro	
   21/01/07 13:24:10 INFO HiveSyncTool: New Partitions [dt=2021-01]
   21/01/07 13:24:10 INFO HoodieHiveClient: Adding partitions 1 to table api_trade_ro
   21/01/07 13:24:10 INFO HoodieHiveClient: Executing SQL ALTER TABLE `ads`.`api_trade_ro` ADD IF NOT EXISTS   PARTITION (`dt`='2021-01') LOCATION '/data/stream/mor/api_trade/dt=2021-01' 
   21/01/07 13:24:33 INFO HiveSyncTool: Changed Partitions []
   21/01/07 13:24:33 INFO HoodieHiveClient: No partitions to change for api_trade_ro
   21/01/07 13:24:33 INFO HiveMetaStore: 0: get_table : db=ads tbl=api_trade_ro
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=get_table : db=ads tbl=api_trade_ro	
   21/01/07 13:24:33 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210107132154
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:91)
   	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
   	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:196)
   	at com.chin.dmp.stream.mor.ApiTradeStream$$anonfun$1.apply(ApiTradeStream.scala:163)
   	at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
   	at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
   	at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
   	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
   	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
   Caused by: NoSuchObjectException(message:ads.api_trade_ro table not found)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
   	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
   	at com.sun.proxy.$Proxy41.get_table(Unknown Source)
   	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
   	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
   	at com.sun.proxy.$Proxy42.getTable(Unknown Source)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
   	... 48 more
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Shutting down the object store...
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Shutting down the object store...	
   21/01/07 13:24:33 INFO HiveMetaStore: 0: Metastore shutdown complete.
   21/01/07 13:24:33 INFO audit: ugi=root	ip=unknown-ip-addr	cmd=Metastore shutdown complete.	
   21/01/07 13:24:33 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> ads, hoodie.insert.shuffle.parallelism -> 10, path -> /data/stream/mor/api_trade, hoodie.datasource.write.precombine.field -> modified, hoodie.datasource.hive_sync.partition_fields -> dt, hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> api_trade, hoodie.index.type -> GLOBAL_BLOOM, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> api_trade, hoodie.datasource.hive_sy
 nc.jdbcurl -> jdbc:hive2://0.0.0.0:10000, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.query.type -> snapshot, hoodie.bloom.index.update.partition.path -
   ----------------------------------------------------------------------------------------------------------------------------------------------
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-891920635


   @dude0001 : Hey if you don't mind, can you create a new github issue. do not want to pollute this issue. since yours is not spark streaming. We can add a link to this issue calling it out as related. also, since this is related to glu catalog, I can CC some aws folks and ask them to help us out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash closed issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash closed issue #2409:
URL: https://github.com/apache/hudi/issues/2409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] dude0001 commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

dude0001 commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-893584529


   I apologize, I did not open a new issue as I've been in meetings and working on production stories. I got back to my POC this morning, and prove your theory correct. Renaming my table all lower case resolved my issue! To your point, I'm not sure my rename exception is the same as the original issue that may or may not be related to streaming. Would you still like me to open an issue for the table rename exception? One change might be to add a warning in Hudi for this scenario. Or adding documentation somewhere to make syncing Hudi metadata to hive in AWS Glue Data Catalog a little less painful. I'm happy to do it, or just move on. Please let me know, and thank you again for the help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #2409:
URL: https://github.com/apache/hudi/issues/2409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-771397330


   @wosow Were you able to resolve your issue ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-853602852


   Closing this ticket due to inactivity. @wosow Please feel free to re-open if you need more information. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
       ```
   There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
          In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured,  asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! !
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798864560


   > @wosow Were you able to resolve your issue ?
   
   no


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-757035695


   @wosow : The _rt table syncing happens after _ro table and I see an HiveMetaStore exception when updating commit time in the _ro table saying that the table does not exist. This is weird as in the few log messages above, I can see that the _ro table is registered. Somehow _ro table is not visible to HiveMetaStoreClient. I think it is likely that HiveServer and HiveMetastore are not setup correctly and there could be more than one HiveMetastore involded (one probably local ) instance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-757035695


   @wosow : The _rt table syncing happens after _ro table and I see an HiveMetaStore exception when updating commit time in the _ro table saying that the table does not exist. This is weird as in the few log messages above, I can see that the _ro table is registered. Somehow _ro table is not visible to HiveMetaStoreClient. I think it is likely that HiveServer and HiveMetastore are not setup correctly and there could be more than one HiveMetastore involded (one probably local ) instance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755171465


   can you copy the contents of hoodie.properties of the dataset here ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-893436528


   @dude0001 : Did you open up a new github issue? usually table rename exception happens if your table name in hudi mismatches w/ that in hive. Is there any case sensitivity that could be an issue? If your table has capital letters, can you try all small letters. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-891871338


   @dude0001 : just to confirm, you are also facing the issue only w/ spark streaming is it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] dude0001 commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

dude0001 commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-889924380


   I'm seeing the same error on 0.8.0. I am using this with AWS Glue (managed) ETL jobs and trying to sync the Hudi metadata to Glue Data Catalog. This happens the first time I run my job and it is trying to create the tables in the Glue Data Catalog. I suspect there is a permissions issue or there is schema evolution being detected that isn't supported with my setup.
   
   I'm getting an additional error" java.lang.UnsupportedOperationException: Table rename is not supported"
   
   I'm just trying a PoC but we are pretty hot on using this as it solves a lot of our problems nicely.
   
   ```
   2021-07-30 10:06:08,704 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
     File "/tmp/raw-to-staging.py", line 146, in <module>
       main()
     File "/tmp/raw-to-staging.py", line 137, in main
       .save("s3://myBucket/Staging/mySourceDB/mySchema/myTable/")
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 734, in save
       self._jwrite.save(path)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o416.save.
   : org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing ChargeActive
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
   	at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
   	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
   	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210730100502
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:168)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:112)
   	... 40 more
   Caused by: java.lang.UnsupportedOperationException: Table rename is not supported
   	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:515)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
   	... 42 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
        There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
   
        In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write 
   the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, 
   asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after 
   opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what 
   parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking 
   forward to your answer! ! !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755838339


   > can you copy the contents of hoodie.properties of the dataset here ?
   hoodie.properties as follows:
   [hoodie.zip](https://github.com/apache/hudi/files/5779109/hoodie.zip)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-905173978


   thanks, we will add an faq shortly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-891871716


   @rmahindra123 : Do you mind taking a look at this. let's sync up sometime today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] n3nash closed issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

n3nash closed issue #2409:
URL: https://github.com/apache/hudi/issues/2409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] dude0001 edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

dude0001 edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-889924380


   I'm seeing the same error, and the same symptom of only the _ro table is created. I am using this with AWS Glue (managed) ETL jobs and trying to sync the Hudi metadata to Glue Data Catalog. This happens the first time I run my job and it is trying to create the tables in the Glue Data Catalog. I suspect there is a permissions issue or there is schema evolution being detected that isn't supported with my setup.
   
   I'm getting an additional error" java.lang.UnsupportedOperationException: Table rename is not supported"
   
   I'm just trying a PoC but we are pretty hot on using this as it solves a lot of our problems nicely.
   
   **Environment Description**
   
   * Hudi version: 0.8.0
   * Spark version : 2.4.3
   * Hive version : 2.4.3 (?)
   * Hadoop version : 2.8.5
   * Storage (HDFS/S3/GCS..) : S3 EMRFS
   * Running on Docker? (yes/no) : ? (I'm using AWS Glue (Managed) ETL, not positive)
   
   ```
   2021-07-30 10:06:08,704 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
     File "/tmp/raw-to-staging.py", line 146, in <module>
       main()
     File "/tmp/raw-to-staging.py", line 137, in main
       .save("s3://myBucket/Staging/mySourceDB/mySchema/myTable/")
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 734, in save
       self._jwrite.save(path)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o416.save.
   : org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing ChargeActive
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
   	at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
   	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
   	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210730100502
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:168)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:112)
   	... 40 more
   Caused by: java.lang.UnsupportedOperationException: Table rename is not supported
   	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:515)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
   	... 42 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-755870995


   It is indeed a MOR table.Can you check your driver logs. You might find some exceptions around registering _rt table. You can look for logs around the log message 
   
   "Trying to sync hoodie table "
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-892287202


   oh, btw COW does not have two tables. only MOR has two tables. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] dude0001 edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

Posted by GitBox <gi...@apache.org>.

dude0001 edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-889924380


   I'm seeing the same error, and the same symptom of only the _ro table is created. I am using this with AWS Glue (managed) ETL jobs and trying to sync the Hudi metadata to Glue Data Catalog. This happens the first time I run my job and it is trying to create the tables in the Glue Data Catalog. I suspect there is a permissions issue or there is schema evolution being detected that isn't supported with my setup. I was initially trying to use an MoR dataset when hitting this error. I've tried using CoW instead and hit the same error.
   
   I'm getting an additional error" java.lang.UnsupportedOperationException: Table rename is not supported"
   
   I'm just trying a PoC but we are pretty hot on using this as it solves a lot of our problems nicely.
   
   **Environment Description**
   
   * Hudi version: 0.8.0
   * Spark version : 2.4.3
   * Hive version : 2.4.3 (?)
   * Hadoop version : 2.8.5
   * Storage (HDFS/S3/GCS..) : S3 EMRFS
   * Running on Docker? (yes/no) : ? (I'm using AWS Glue (Managed) ETL, not positive)
   
   ```
   2021-07-30 10:06:08,704 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
     File "/tmp/raw-to-staging.py", line 146, in <module>
       main()
     File "/tmp/raw-to-staging.py", line 137, in main
       .save("s3://myBucket/Staging/mySourceDB/mySchema/myTable/")
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 734, in save
       self._jwrite.save(path)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
       return f(*a, **kw)
     File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
       format(target_id, ".", name), value)
   py4j.protocol.Py4JJavaError: An error occurred while calling o416.save.
   : org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing ChargeActive
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
   	at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
   	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
   	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
   	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
   	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
   	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20210730100502
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:496)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:168)
   	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:112)
   	... 40 more
   Caused by: java.lang.UnsupportedOperationException: Table rename is not supported
   	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:515)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:400)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:385)
   	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:494)
   	... 42 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org