You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/23 20:54:12 UTC

[GitHub] [hudi] nsivabalan opened a new issue #4889: [SUPPORT] Clustering fails w/ metadata table

nsivabalan opened a new issue #4889:
URL: https://github.com/apache/hudi/issues/4889


   **Describe the problem you faced**
   
   Ran a deltastreamer job w/ inline clustering and metadata enabled w/ latest master. Clustering failed towards completion when building col stats index for metadata table. 
   
   this is actually an integ test suite job that failed. 
   
   property file:
   https://github.com/apache/hudi/pull/4884/files#diff-63e53dde6161fec3341cf85817ff0d8f2b0d33d75803da8818afaf3bb544722e
   
   yaml file:
   https://github.com/apache/hudi/blob/master/docker/demo/config/test-suite/deltastreamer-long-running-multi-partitions.yaml
   
   It is strange, given that I did not enable col stats and just left it to default settings. 
   
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11-snapshot
   
   * Spark version : 2.4.5
   
   * Hive version :
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : 
   
   * Running on Docker? (yes/no) : docker
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   .
   .
   03:47  WARN: Timeline-server-based markers are configured as the marker type but embedded timeline server is not enabled.  Falling back to direct markers.
   22/02/23 16:22:11 ERROR DagScheduler: Exception executing node
   org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394)
   at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473)
   at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360)
   at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194)
   at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502)
   at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74)
   at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267)
   at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at scala.Option.getOrElse(Option.scala:121)
   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
   at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177)
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386)
   ... 19 more
   22/02/23 16:22:11 INFO DagScheduler: Forcing shutdown of executor service, this might kill running tasks
   22/02/23 16:22:11 ERROR HoodieTestSuiteJob: Failed to run Test Suite
   java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:845)
   at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:920)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:146)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394)
   at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473)
   at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360)
   at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194)
   at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502)
   at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74)
   at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267)
   at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
   ... 6 more
   Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at scala.Option.getOrElse(Option.scala:121)
   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
   at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177)
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386)
   ... 19 more
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Failed to run Test Suite
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:208)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:170)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:845)
   at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at org.apache.spark.deploy.SparkSubmit$anon$2.doSubmit(SparkSubmit.scala:920)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.execute(DagScheduler.java:113)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.schedule(DagScheduler.java:68)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.runTestSuite(HoodieTestSuiteJob.java:203)
   ... 13 more
   Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:146)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieClusteringException: unable to transition clustering inflight to complete: 20220223162156842
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:394)
   at org.apache.hudi.client.SparkRDDWriteClient.completeTableService(SparkRDDWriteClient.java:473)
   at org.apache.hudi.client.SparkRDDWriteClient.cluster(SparkRDDWriteClient.java:360)
   at org.apache.hudi.client.BaseHoodieWriteClient.lambda$inlineClustering$15(BaseHoodieWriteClient.java:1196)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at org.apache.hudi.client.BaseHoodieWriteClient.inlineClustering(BaseHoodieWriteClient.java:1194)
   at org.apache.hudi.client.BaseHoodieWriteClient.runTableServicesInline(BaseHoodieWriteClient.java:502)
   at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:211)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:124)
   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:74)
   at org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:173)
   at org.apache.hudi.integ.testsuite.HoodieTestSuiteWriter.commit(HoodieTestSuiteWriter.java:267)
   at org.apache.hudi.integ.testsuite.dag.nodes.InsertNode.execute(InsertNode.java:54)
   at org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
   ... 6 more
   Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at org.apache.spark.sql.execution.datasources.DataSource$anonfun$7.apply(DataSource.scala:185)
   at scala.Option.getOrElse(Option.scala:121)
   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
   at org.apache.hudi.index.columnstats.ColumnStatsIndexHelper.updateColumnStatsIndexFor(ColumnStatsIndexHelper.java:315)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateColumnsStatsIndex(HoodieSparkCopyOnWriteTable.java:219)
   at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.updateMetadataIndexes(HoodieSparkCopyOnWriteTable.java:177)
   at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:386)
   ... 19 more
    
   
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yihua commented on issue #4889: [SUPPORT] Clustering fails w/ latest master with col stats partition even after disabling metadata table

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #4889:
URL: https://github.com/apache/hudi/issues/4889#issuecomment-1050446353


   I'm going to put up a fix on this.  Tracked in https://issues.apache.org/jira/browse/HUDI-3513.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4889: [SUPPORT] Clustering fails w/ latest master with col stats partition even after disabling metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4889:
URL: https://github.com/apache/hudi/issues/4889


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4889: [SUPPORT] Clustering fails w/ latest master with col stats partition even after disabling metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4889:
URL: https://github.com/apache/hudi/issues/4889#issuecomment-1050811079


   closing the GH issue as we have a tracking jira. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yihua commented on issue #4889: [SUPPORT] Clustering fails w/ latest master with col stats partition even after disabling metadata table

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #4889:
URL: https://github.com/apache/hudi/issues/4889#issuecomment-1050522871


   The method `table.updateMetadataIndexes()` actually updates the columns stats index in `.hoodie` folder, not the column stats index in metadata table.  @alexeykudinkin could you take a look at this exception?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4889: [SUPPORT] Clustering fails w/ latest master with col stats partition even after disabling metadata table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4889:
URL: https://github.com/apache/hudi/issues/4889#issuecomment-1050384844


   excerpt from SparkRDDWRiteClient. last line here needs to be called only when metadata is enabled.
   ```
   // Update table's metadata (table)
         updateTableMetadata(table, metadata, clusteringInstant);
         // Update tables' metadata indexes
         // NOTE: This overlaps w/ metadata table (above) and will be reconciled in the future
        table.updateMetadataIndexes(context, writeStats, clusteringCommitTime);
   ````
   
   Also, do verify if we update metadata indexes for compaction. I did not see explicit calls in completeCompaction() method. Also, ensure we call only when metadata is enabled. 
   
   
   CC @codope @yihua 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org