You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/21 14:50:08 UTC

[GitHub] [hudi] GurRonenExplorium opened a new issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

GurRonenExplorium opened a new issue #1856:
URL: https://github.com/apache/hudi/issues/1856


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Hey,
   tl;dr: Hive Sync is failing on `alter table cascade`
   
   I am running a PoC with Hudi and started working with a timeseries dataset we have, input is partitioned by insertion_time with late data being maximum 48hr. output is the same dataset, with event_time partitions and some additional fields (all of them are row-by-row with no aggregations)
   
   Setup: AWS EMR, setting up transient clusters (spark for the job itself, hive for access to glue metastore for the HiveSync tool - btw if there is a better way I'm happy to hear)
   
   Steps i did:
   1. load 1 day of data (worked well)
   2. loaded a few extra days with 1 partition batches each time (so each run was a single insertion time partition) everything synced well to
   3. run on a full month of data in a single job
   4. Successfully load data to hudi, HiveSync failed with alter table error
   
   A clear and concise description of the problem.
   
   
   **Expected behavior**
   
   Hive Sync shouldn't crash when syncing to glue catalog
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   EMR: 5.30.1
   
   **Stacktrace**
   stacktrace is a bit redacted, if anything more is needed i can get it
   ```
   20/07/19 19:27:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
   org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL ALTER TABLE `#DB_NAME#`.`#TABLE_NAME#` REPLACE COLUMNS(`_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `utc_timestamp` string, `local_timestamp_with_timezone` string, `utc_timestamp_with_timezone` string, `#COL1#` string, `#COL2#` string, `#COL3#` double, `#COL4#` double, `#COL5#` string, `#COL6#` string, `#COL7#` double, `#COL8#` double, `#COL9#` string, `#COL10#` bigint, `#COL11#` string, `#COL12#` string, `#COL13#` string, `#COL14#` string, `#COL15#` string, `#COL16#` string, `#COL17#` string, `#COL18#` int, `hash_id` string, `#REDACTED#_6` string, `#REDACTED#_7` string, `#REDACTED#_8` string, `#REDACTED#_9` string, `#REDACTED#_10` string, `#REDACTED#_11` string, `offset_year` int, `offset_month` int, `offset_dayofmonth` int, `offset_dayofweek` int, `offset_hourofday` int ) cascade
       at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:482)
       at org.apache.hudi.hive.HoodieHiveClient.updateTableDefinition(HoodieHiveClient.java:261)
       at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:164)
       at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:114)
       at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
       at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
       at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
       at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
       at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
       at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
       at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
       at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
       at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
       at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
       at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
       at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
       at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
       at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
       at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
       at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
       at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
       at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
       at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
       at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
       at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
       at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
       at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
       at ai.explorium.reveal.RevealS3IngestorApp$.main(RevealS3IngestorApp.scala:89)
       at ai.explorium.reveal.RevealS3IngestorApp.main(RevealS3IngestorApp.scala)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
       at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
       at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
       at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
       at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
       at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: java.sql.SQLException: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cascade for alter_table is not supported
       at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
       at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
       at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
       at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
       at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:363)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.UnsupportedOperationException: Cascade for alter_table is not supported
       at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:509)
       at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table_with_environmentContext(AWSCatalogMetastoreClient.java:438)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336)
       at com.sun.proxy.$Proxy42.alter_table_with_environmentContext(Unknown Source)
       at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:628)
       at org.apache.hadoop.hive.ql.exec.DDLTask.alterTable(DDLTask.java:3590)
       at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:390)
       at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
       at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
       at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
       at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232)
       at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255)
       ... 11 more
   
       at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:297)
       at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:480)
       ... 47 more
   20/07/19 19:27:47 INFO SparkContext: Invoking stop() from shutdown hook
   20/07/19 19:27:47 INFO SparkUI: Stopped Spark web UI at http://ip-172-31-64-38.eu-west-1.compute.internal:4040
   20/07/19 19:27:47 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/07/19 19:27:47 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/07/19 19:27:47 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/07/19 19:27:47 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   (serviceOption=None,
    services=List(),
    started=false)
   20/07/19 19:27:47 INFO YarnClientSchedulerBackend: Stopped
   20/07/19 19:27:47 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   20/07/19 19:27:47 INFO MemoryStore: MemoryStore cleared
   20/07/19 19:27:47 INFO BlockManager: BlockManager stopped
   20/07/19 19:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
   20/07/19 19:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
   20/07/19 19:27:47 INFO SparkContext: Successfully stopped SparkContext
   20/07/19 19:27:47 INFO ShutdownHookManager: Shutdown hook called
   20/07/19 19:27:47 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-7ba98d71-ce9e-4f47-838d-02093ea288fc
   20/07/19 19:27:47 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-bc6a9489-a6d0-47c9-a30b-04c538bf519e
   Command exiting with ret '0'
   ```
   Thanks for this project!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-663677430


   Please reopen if you need further clarifications.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] GurRonenExplorium commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

GurRonenExplorium commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-661909397


   Additional context, the hudi configuration:
   ```
   val hudiOptions = Map[String, String](
         HoodieWriteConfig.TABLE_NAME -> tableName,
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> uuidColumn,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> partitionColumn,
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> precombineField,
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
         DataSourceWriteOptions.OPERATION_OPT_KEY -> ingestMode.getOrElse(DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL),
         DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> databaseName,
         DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
         DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> partitionColumn,
         DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
         DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
         HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE -> String.valueOf(5000000),
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS -> String.valueOf(true),
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE -> String.valueOf(200)
       )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] WTa-hash commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

WTa-hash commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-662436124


   @GurRonenExplorium - a JIRA issue exists for this problem: https://issues.apache.org/jira/browse/HUDI-874


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-662544225


   cc @umehrot2 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] GurRonenExplorium commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

GurRonenExplorium commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-662442723


   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1856:
URL: https://github.com/apache/hudi/issues/1856


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] GurRonenExplorium edited a comment on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

GurRonenExplorium edited a comment on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-661909397


   Additional context, the hudi configuration:
   ```
   val hudiOptions = Map[String, String](
         HoodieWriteConfig.TABLE_NAME -> tableName,
         DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> uuidColumn,
         DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> partitionColumn,
         DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> precombineField,
         DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
         DataSourceWriteOptions.OPERATION_OPT_KEY -> "insert",
         DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> databaseName,
         DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> tableName,
         DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> partitionColumn,
         DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
         DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
         HoodieStorageConfig.PARQUET_COMPRESSION_CODEC -> "snappy",
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE -> String.valueOf(5000000),
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS -> String.valueOf(true),
         HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE -> String.valueOf(200)
       )
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] umehrot2 commented on issue #1856: [SUPPORT] HiveSyncTool fails on alter table cascade

Posted by GitBox <gi...@apache.org>.

umehrot2 commented on issue #1856:
URL: https://github.com/apache/hudi/issues/1856#issuecomment-662694529


   @GurRonenExplorium @bvaradar EMR team is aware of this issue when working with Glue metastore. We have fixed it, however it will only be provided in the future EMR releases. Right now the fix is not in any of the existing EMR releases.
   
   For now if you want `Schema Evolution` with Hudi on EMR, I would suggest using Hive as the metastore instead of Glue.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org