You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2022/01/13 22:25:00 UTC
[jira] [Closed] (HUDI-2943) Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering

     [ https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sivabalan narayanan closed HUDI-2943.
-------------------------------------
    Resolution: Fixed

> Deltastreamer fails to continue with pending clustering after restart in 0.10.0 and inline clustering
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2943
>                 URL: https://issues.apache.org/jira/browse/HUDI-2943
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Harsha Teja Kanna
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: core-flow-ds, pull-request-available, sev:high
>             Fix For: 0.10.1
>
>         Attachments: image-2021-12-08-15-10-02-420.png
>
>
> Deltastreamer fails to restart when there is a pending clustering commit from a previous run with Upsert failed exception when inline clustering is on.
> {*}Note{*}: workaround of running Clustering job with --retry-last-failed-clustering-job works
> Hudi version : 0.10.0
> Spark version : 3.1.2
> EMR : 6.4.0
> diagnostics: User class threw exception: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 20211206081248919
> at org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
> at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
> at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
> at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
> at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
> at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
> at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
> at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
> at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
> at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
> at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
> Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not allowed to update the clustering file group HoodieFileGroupId\{partitionPath='', fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering operations, we are not going to support update for now.
> at org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)
> Config:
> hoodie.index.type=GLOBAL_SIMPLE
> hoodie.datasource.write.partitionpath.field=
> hoodie.datasource.write.precombine.field=updatedate
> hoodie.datasource.hive_sync.database=datalake
> hoodie.datasource.write.operation=upsert
> hoodie.datasource.hive_sync.table=hudi.prd.surveys
> hoodie.datasource.hive_sync.mode=hms
> hoodie.datasource.hive_sync.enable=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.finalize.write.parallelism=256
> hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> hoodie.parquet.max.file.size=134217728
> hoodie.parquet.small.file.limit=67108864
> hoodie.parquet.block.size=134217728
> hoodie.parquet.compression.codec=snappy
> hoodie.file.listing.parallelism=256
> hoodie.upsert.shuffle.parallelism=10
> hoodie.metadata.enable=false
> hoodie.metadata.clean.async=true
> hoodie.clustering.preserve.commit.metadata=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.inline=true
> hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
> hoodie.clustering.plan.strategy.small.file.limit=67108864
> hoodie.clustering.plan.strategy.sort.columns=projectid
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
> hoodie.clean.async=true
> hoodie.clean.automatic=true
> hoodie.cleaner.policy=KEEP_LATEST_COMMITS
> hoodie.cleaner.commits.retained=10
> hoodie.deltastreamer.transformer.sql=SELECT id, sid FROM <SRC> a



--
This message was sent by Atlassian Jira
(v8.20.1#820001)