You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Ethan Guo (Jira)" <ji...@apache.org> on 2022/02/08 03:49:00 UTC

[jira] [Updated] (HUDI-3375) Investigate deltastreamer continuous mode getting stuck when metadata table is enabled

     [ https://issues.apache.org/jira/browse/HUDI-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Guo updated HUDI-3375:
----------------------------
    Sprint: Hudi-Sprint-Jan-31

> Investigate deltastreamer continuous mode getting stuck when metadata table is enabled
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-3375
>                 URL: https://issues.apache.org/jira/browse/HUDI-3375
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Deltastreamer continuous mode writing MOR table with upserts, with async Compaction, Clustering, and Cleaner, archival and metadata table enabled:
> {code:java}
> /Users/ethan/Work/lib/spark-3.2.0-bin-hadoop3.2/bin/spark-submit \
>       --master local[3] \
>       --driver-memory 3g --executor-memory 1g --num-executors 3 --executor-cores 1 \
>       --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>       --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
>       --conf spark.sql.catalogImplementation=hive \
>       --conf spark.driver.maxResultSize=1g \
>       --conf spark.speculation=true \
>       --conf spark.speculation.multiplier=1.0 \
>       --conf spark.speculation.quantile=0.5 \
>       --packages org.apache.spark:spark-avro_2.12:3.2.0 \
>       --jars /Users/ethan/Work/repo/hudi-benchmarks/target/hudi-benchmarks-0.1-SNAPSHOT.jar \
>       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>       /Users/ethan/Work/lib/hudi-utilities-bundle_2.12-0.11.0-SNAPSHOT-no-error-inj.jar \
>       --props /Users/ethan/Work/scripts/metadata_test_ds_mor_continuous.properties \
>       --source-class BenchmarkDataSource \
>       --source-ordering-field ts \
>       --target-base-path /Users/ethan/Work/data/hudi/metadata_test_ds_mor_continuous_1 \
>       --target-table metadata_test_ds_mor_continuous_table_1 \
>       --table-type MERGE_ON_READ \
>       --op UPSERT \
>       --continuous >> metadata_test_ds_mor_continuous_1_output.log 2>&1 {code}
> metadata_test_ds_mor_continuous.properties:
> {code:java}
> hoodie.upsert.shuffle.parallelism=40
> hoodie.insert.shuffle.parallelism=40
> hoodie.delete.shuffle.parallelism=40
> hoodie.bulkinsert.shuffle.parallelism=40
> # Key fields, for kafka example
> hoodie.datasource.write.recordkey.field=key
> hoodie.datasource.write.partitionpath.field=partition
> # Schema provider props (change to absolute path based on your installation)
> hoodie.deltastreamer.schemaprovider.source.schema.file=file:/Users/ethan/Work/scripts/benchmark_schema.avsc
> hoodie.deltastreamer.schemaprovider.target.schema.file=file:/Users/ethan/Work/scripts/benchmark_schema.avsc
> # DFS Source
> hoodie.deltastreamer.source.dfs.root=file:/Users/ethan/Work/data/hudi/benchmark_sample_upserts2
> benchmark.input.source.path=file:/Users/ethan/Work/data/hudi/benchmark_sample_upserts2
> # Clustering
> hoodie.clustering.async.enabled=true
> hoodie.clustering.async.max.commits=6
> # Compaction
> hoodie.compact.inline.max.delta.commits=3
> # Clean and archive
> hoodie.clean.async=true
> hoodie.keep.max.commits=7
> hoodie.keep.min.commits=5
> hoodie.cleaner.commits.retained=4
> # Concurrency control
> hoodie.write.concurrency.mode=optimistic_concurrency_control
> hoodie.cleaner.policy.failed.writes=LAZY
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
> # Metadata table
> hoodie.metadata.compact.max.delta.commits=5
> hoodie.metadata.keep.min.commits=8
> hoodie.metadata.keep.max.commits=12 {code}
> The deltastreamer cannot proceed further after around 50 commits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)