You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/12 02:21:35 UTC

[GitHub] [hudi] maikouliujian opened a new issue, #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

maikouliujian opened a new issue, #7653:
URL: https://github.com/apache/hudi/issues/7653

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   when I write hudi cow table to aws s3 concurrently by spark api, the except (java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes) happen. I use hudi ZookeeperBasedLockProvider occ.
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.Run 0.11.0 version Hudi  write concurrently to aws s3.
   2.use the below configuration.
      resultDF.write.format("hudi")
         .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "uq_id,_track_id,event,_flush_time")
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), "process_time")
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "p_day,p_hour,p_region,p_type")
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(), classOf[ComplexKeyGenerator].getName)
         .option(HoodieWriteConfig.TBL_NAME.key(), sinkHudiTable)
         .option(DataSourceWriteOptions.HIVE_URL.key(), "")
         .option(DataSourceWriteOptions.HIVE_DATABASE.key(), "default")
         .option(DataSourceWriteOptions.HIVE_TABLE.key(), sinkHudiTable)
         .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key(), "p_day,p_hour,p_region,p_type")
         .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key(), classOf[MultiPartKeysValueExtractor].getName)
         .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED.key(), "true")
         .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(), "true")
         .option(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.GLOBAL_BLOOM.name())
         .option(HoodieCompactionConfig.CLEANER_POLICY.key(), HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name)
         .option(HoodieCompactionConfig.ASYNC_CLEAN.key(), "true")
         .option(HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key(), "240")
         .option(HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key(), "250")
         .option(HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key(), "260")
         .option(HoodieCompactionConfig.FAILED_WRITES_CLEANER_POLICY.key(), HoodieFailedWritesCleaningPolicy.LAZY.name())
         .option(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key(), WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name())
         .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
         .option("hoodie.write.lock.zookeeper.url", "")
         .option("hoodie.write.lock.zookeeper.port", "2181")
         .option("hoodie.write.lock.zookeeper.lock_key", sinkHudiTable)
         .option("hoodie.write.lock.zookeeper.base_path", "/hudi_multiwriter")
         .option("hoodie.insert.shuffle.parallelism", "2")
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .mode(SaveMode.Append)
         .save(s3outPath)
   3.
   **Expected behavior**
   
    Exception in thread "main" org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   the aws emr version is 6.7.0
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   Exception in thread "main" org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
    	at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
    	at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85)
    	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
    	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
    	at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
    	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
    	at org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:79)
    	at org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:475)
    	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:233)
    	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:122)
    	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:678)
    	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:313)
    	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:165)
    	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
    	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
    	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
    	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
    	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
    	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
    	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
    	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
    	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
    	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
    	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
    	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
    	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136)
    	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
    	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
    	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:303)
    	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
    	at EtlHudi2hudi.writeData(EtlHudi2hudi.scala:113)
    	at EtlHudi2hudi.run(EtlHudi2hudi.scala:53)
    	at EtlHudi2hudi$.main(EtlHudi2hudi.scala:32)
    	at EtlHudi2hudi.main(EtlHudi2hudi.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
    	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)
    	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)
    	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
    	... 64 more
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "SamarthRaval (via GitHub)" <gi...@apache.org>.
SamarthRaval commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1765267904

   @Jason-liujc 
   Can we just increase yarn.resourcemanager.am.max-attempts to higher number?
   so we can retry to run hudi again if somehow it fails on  java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] fengjian428 commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by GitBox <gi...@apache.org>.
fengjian428 commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1383615588

   this should be expected behavior when multiple writers write records into the same File group.
   What behavior do you want? @maikouliujian 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1396360092

   how are you writing to hudi. Can you give us some reproducible script. Is it spark-datasource, or spark streaming, detalstreamer, spark-sql. atleast from spark-shell, when you are using spark-ds writer, we know the command fails for sure. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maikouliujian commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by GitBox <gi...@apache.org>.
maikouliujian commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1379998635

   > Can you job recover automatically from the failure?
   
   In my case,the exception is not root exception, so when this exception happens, my job is always running. so I can not catch the failure of the job. Do you have any other suggestions? thanks very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Jason-liujc commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "Jason-liujc (via GitHub)" <gi...@apache.org>.
Jason-liujc commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1687260803

   We are encountering the same issue. After using DynamoDB as the lock table, we still see this error: `java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes`
   
   What I observed:
   1. I have 4 EMR Spark clusters that writes to the same table. One by one, it fails with the above error. When I look at the DynamoDB lock history, I see locks constantly getting created and released. 
   2. The DynamoDB lock is not at file level, but on the table level. So two Hudi jobs might try to write to the same files and one of them failure. It seems if there are a couple of concurrent jobs running at the same time writing to the same files, it'll go into some sort of failure storm, which might fail everything unless you set a really really high retry threshold.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1379955435

   Can you job recover automatically from the failure?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tomyanth commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "tomyanth (via GitHub)" <gi...@apache.org>.
tomyanth commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1538080521

   I have the same issue of running multi-write in local console as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maikouliujian commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "maikouliujian (via GitHub)" <gi...@apache.org>.
maikouliujian commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1415695253

   > I tried multi-writers from two diff spark-shells, and one of them fails while writing to hudi.
   > 
   > ```
   > 
   > 
   > scala> df2.write.format("hudi").
   >      |   options(getQuickstartWriteConfigs).
   >      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
   >      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
   >      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
   >      |   option(TABLE_NAME, tableName).
   >      |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
   >      |   option("hoodie.cleaner.policy.failed.writes","LAZY").
   >      |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
   >      |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
   >      |   option("hoodie.write.lock.zookeeper.port","2181").
   >      |   option("hoodie.write.lock.zookeeper.lock_key","locks").
   >      |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
   >      |   mode(Append).
   >      |   save(basePath)
   > warning: there was one deprecation warning; re-run with -deprecation for details
   > [Stage 14:>                                                         (0 + 3) / 3]# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   > 23/01/23 10:00:20 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   > org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
   >   at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
   >   at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85)
   >   at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
   >   at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
   >   at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
   >   at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
   >   at org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:79)
   >   at org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:491)
   >   at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:234)
   >   at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:126)
   >   at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:698)
   >   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:343)
   >   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
   >   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   >   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   >   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
   >   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
   >   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
   >   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   >   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
   >   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
   >   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
   >   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
   >   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
   >   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
   >   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
   >   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
   >   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
   >   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
   >   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
   >   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
   >   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
   >   ... 75 elided
   > Caused by: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
   >   ... 109 more
   > 
   > scala> 
   > ```
   > 
   > Write to hudi fails and next command prompt it seen.
   > 
   > excerpt from my other shell which succeeded.
   > 
   > ```
   > scala> df2.write.format("hudi").
   >      |   options(getQuickstartWriteConfigs).
   >      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
   >      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
   >      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
   >      |   option(TABLE_NAME, tableName).
   >      |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
   >      |   option("hoodie.cleaner.policy.failed.writes","LAZY").
   >      |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
   >      |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
   >      |   option("hoodie.write.lock.zookeeper.port","2181").
   >      |   option("hoodie.write.lock.zookeeper.lock_key","locks").
   >      |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
   >      |   mode(Append).
   >      |   save(basePath)
   > warning: one deprecation; for details, enable `:setting -deprecation' or `:replay -deprecation'
   > # WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   > 23/01/23 10:00:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   >                                                                                 
   > scala> 
   > ```
   > 
   > If you can provide us w/ reproducible script, would be nice. as of now, its not reproducible from our end
   In hudi 0.11.0, it can not supports multiple writers on spark ds ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "Jason-liujc (via GitHub)" <gi...@apache.org>.
Jason-liujc commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1766886802

   Can't speak to what the official guidance from Hudi is at the moment (seems like they will rollout the non-blocking concurent write feature in version 1.0+).
   
   We had to increase `yarn.resourcemanager.am.max-attempts` and `spark.yarn.maxAppAttempts` (the spark specific config) to make it retry more and reoganize our tables to reduce concurrent writes. Any other lock provider wasn't an option for us since we are running different jobs from different clusters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tomyanth commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "tomyanth (via GitHub)" <gi...@apache.org>.
tomyanth commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1538082248

   ![image](https://user-images.githubusercontent.com/111942577/236793809-019d5279-a570-4cb3-bcd5-939c755092bd.png)
   ![image](https://user-images.githubusercontent.com/111942577/236793920-834e373e-60e1-4f69-8868-ac5662a15826.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maikouliujian commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by GitBox <gi...@apache.org>.
maikouliujian commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1383689373

   
   
   
   > this should be expected behavior when multiple writers write records into the same File group. What behavior do you want? @maikouliujian
   
   In my case,when this exception happens,my job is not failed,but always running . however ,the job
   can not finish . so how can i know the job runs correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "Jason-liujc (via GitHub)" <gi...@apache.org>.
Jason-liujc commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1752131875

   The main thing we did was to change our Hudi table structure to avoid concurrent writes to the same partition as much as possible (batch workload together, sequence the job etc)
   
   For us, the DynamoDB lock provider wasn't able to to do any write retries, so it just fails the Spark job. We increased the yarn and spark retry to automatically retry from the cluster side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tomyanth commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "tomyanth (via GitHub)" <gi...@apache.org>.
tomyanth commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1538089746

   when I run the above code in 2 seperate notebooks to simulate the multi-write process, the error occurs. Just like what @maikouliujian had faced


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1407256375

   @maikouliujian : any updates please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "subash-metica (via GitHub)" <gi...@apache.org>.
subash-metica commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1824457795

   I had similar issue when tried to perform clustering (as separate process) and stream happening at same time. Even after providing lock providers (Zookeeper) running on same cluster , why this behaviour happens ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1400639443

   looks like its a spark datasource write. how do you claim that the job does not fail? are you executing it from spark-shell and the command to write to hudi is just stuck? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1594545704

   ```
   DYNAMODB_LOCK_TABLE_NAME = 'hudi-lock-table'
   curr_session = boto3.session.Session()
   curr_region = curr_session.region_name
   
   
   def upsert_hudi_table(glue_database, table_name,
                         record_id, precomb_key, table_type, spark_df,
                         enable_partition, enable_cleaner, enable_hive_sync, enable_dynamodb_lock,
                         use_sql_transformer, sql_transformer_query,
                         target_path, index_type, method='upsert'):
       """
       Upserts a dataframe into a Hudi table.
   
       Args:
           glue_database (str): The name of the glue database.
           table_name (str): The name of the Hudi table.
           record_id (str): The name of the field in the dataframe that will be used as the record key.
           precomb_key (str): The name of the field in the dataframe that will be used for pre-combine.
           table_type (str): The Hudi table type (e.g., COPY_ON_WRITE, MERGE_ON_READ).
           spark_df (pyspark.sql.DataFrame): The dataframe to upsert.
           enable_partition (bool): Whether or not to enable partitioning.
           enable_cleaner (bool): Whether or not to enable data cleaning.
           enable_hive_sync (bool): Whether or not to enable syncing with Hive.
           use_sql_transformer (bool): Whether or not to use SQL to transform the dataframe before upserting.
           sql_transformer_query (str): The SQL query to use for data transformation.
           target_path (str): The path to the target Hudi table.
           method (str): The Hudi write method to use (default is 'upsert').
           index_type : BLOOM or GLOBAL_BLOOM
       Returns:
           None
       """
       # These are the basic settings for the Hoodie table
       hudi_final_settings = {
           "hoodie.table.name": table_name,
           "hoodie.datasource.write.table.type": table_type,
           "hoodie.datasource.write.operation": method,
           "hoodie.datasource.write.recordkey.field": record_id,
           "hoodie.datasource.write.precombine.field": precomb_key,
       }
   
       # These settings enable syncing with Hive
       hudi_hive_sync_settings = {
           "hoodie.parquet.compression.codec": "gzip",
           "hoodie.datasource.hive_sync.enable": "true",
           "hoodie.datasource.hive_sync.database": glue_database,
           "hoodie.datasource.hive_sync.table": table_name,
           "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
           "hoodie.datasource.hive_sync.use_jdbc": "false",
           "hoodie.datasource.hive_sync.mode": "hms",
       }
   
       # These settings enable automatic cleaning of old data
       hudi_cleaner_options = {
           "hoodie.clean.automatic": "true",
           "hoodie.clean.async": "true",
           "hoodie.cleaner.policy": 'KEEP_LATEST_FILE_VERSIONS',
           "hoodie.cleaner.fileversions.retained": "3",
           "hoodie-conf hoodie.cleaner.parallelism": '200',
           'hoodie.cleaner.commits.retained': 5
       }
   
       # These settings enable partitioning of the data
       partition_settings = {
           "hoodie.datasource.write.partitionpath.field": args['PARTITON_FIELDS'],
           "hoodie.datasource.hive_sync.partition_fields": args['PARTITON_FIELDS'],
           "hoodie.datasource.write.hive_style_partitioning": "true",
       }
   
       # Define a dictionary with the index settings for Hudi
       hudi_index_settings = {
           "hoodie.index.type": index_type,  # Specify the index type for Hudi
       }
   
       hudi_dynamo_db_based_lock = {
           'hoodie.write.concurrency.mode': 'optimistic_concurrency_control'
           , 'hoodie.cleaner.policy.failed.writes': 'LAZY'
           , 'hoodie.write.lock.provider': 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
           , 'hoodie.write.lock.dynamodb.table': DYNAMODB_LOCK_TABLE_NAME
           , 'hoodie.write.lock.dynamodb.partition_key': 'tablename'
           , 'hoodie.write.lock.dynamodb.region': '{0}'.format(curr_region)
           , 'hoodie.write.lock.dynamodb.endpoint_url': 'dynamodb.{0}.amazonaws.com'.format(curr_region)
           , 'hoodie.write.lock.dynamodb.billing_mode': 'PAY_PER_REQUEST'
   
       }
   
       hudi_file_size = {
           "hoodie.parquet.max.file.size": 512 * 1024 * 1024,  # 512MB
           "hoodie.parquet.small.file.limit": 104857600,  # 100MB
       }
   
       # Add the Hudi index settings to the final settings dictionary
       for key, value in hudi_index_settings.items():
           hudi_final_settings[key] = value  # Add the key-value pair to the final settings dictionary
   
       for key, value in hudi_file_size.items():
           hudi_final_settings[key] = value
   
           # If partitioning is enabled, add the partition settings to the final settings
       if enable_partition == "True" or enable_partition == "true" or enable_partition == True:
           for key, value in partition_settings.items(): hudi_final_settings[key] = value
   
       # if DynamoDB based lock enabled use dynamodb as lock table
       if enable_dynamodb_lock == "True" or enable_dynamodb_lock == "true" or enable_dynamodb_lock == True:
           for key, value in hudi_dynamo_db_based_lock.items(): hudi_final_settings[key] = value
   
       # If data cleaning is enabled, add the cleaner options to the final settings
       if enable_cleaner == "True" or enable_cleaner == "true" or enable_cleaner == True:
           for key, value in hudi_cleaner_options.items(): hudi_final_settings[key] = value
   
       # If Hive syncing is enabled, add the Hive sync settings to the final settings
       if enable_hive_sync == "True" or enable_hive_sync == "true" or enable_hive_sync == True:
           for key, value in hudi_hive_sync_settings.items(): hudi_final_settings[key] = value
   
       # If there is data to write, apply any SQL transformations and write to the target path
       if spark_df.count() > 0:
   
           if use_sql_transformer == "True" or use_sql_transformer == "true" or use_sql_transformer == True:
               spark_df.createOrReplaceTempView("temp")
               spark_df = spark.sql(sql_transformer_query)
   
           # Replace null values in all columns with default value 'unknown'
           default_value = 'n/a'
           for column in spark_df.columns:
               spark_df = spark_df.na.fill(default_value)
   
           print("**************************************************************")
           print(spark_df.show())
           print("**************************************************************")
   
           spark_df.write.format("hudi"). \
               options(**hudi_final_settings). \
               mode("append"). \
               save(target_path)
   
   ```
   
   you can use dynamoDB lock tables 
   
   ```
   
   DYNAMODB_LOCK_TABLE_NAME = 'hudi-lock-table'
   curr_session = boto3.session.Session()
   curr_region = curr_session.region_name
   ````
   
   ```
   
       hudi_dynamo_db_based_lock = {
           'hoodie.write.concurrency.mode': 'optimistic_concurrency_control'
           , 'hoodie.cleaner.policy.failed.writes': 'LAZY'
           , 'hoodie.write.lock.provider': 'org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider'
           , 'hoodie.write.lock.dynamodb.table': DYNAMODB_LOCK_TABLE_NAME
           , 'hoodie.write.lock.dynamodb.partition_key': 'tablename'
           , 'hoodie.write.lock.dynamodb.region': '{0}'.format(curr_region)
           , 'hoodie.write.lock.dynamodb.endpoint_url': 'dynamodb.{0}.amazonaws.com'.format(curr_region)
           , 'hoodie.write.lock.dynamodb.billing_mode': 'PAY_PER_REQUEST'
   
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1591005626

   try using DynamoDB as lock Table 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tomyanth commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "tomyanth (via GitHub)" <gi...@apache.org>.
tomyanth commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1538086919

   
   """
   
   Install
   https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop2.tgz
   
   hadoop2.7
   https://github.com/soumilshah1995/winutils/blob/master/hadoop-2.7.7/bin/winutils.exe
   
   pyspark --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   VAR
   SPARK_HOME
   HADOOP_HOME
   
   
   PATH
   `%HAPOOP_HOME%\bin`
   `%SPARK_HOME%\bin`
   
   Complete Tutorials on HUDI
   https://github.com/soumilshah1995/Insert-Update-Read-Write-SnapShot-Time-Travel-incremental-Query-on-APache-Hudi-transacti/blob/main/hudi%20(1).ipynb
   """
   
   
   
   import os
   import sys
   import uuid
   
   import pyspark
   from pyspark.sql import SparkSession
   from pyspark import SparkConf, SparkContext
   from pyspark.sql.functions import col, asc, desc
   from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
   from pyspark.sql.functions import *
   from pyspark.sql.types import *
   from datetime import datetime
   from functools import reduce
   from faker import Faker
   from faker import Faker
   
   import findspark
   
   import datetime 
   
   time = datetime.datetime.now()
   time = time.strftime("YMD%Y%m%dHHMMSSms%H%M%S%f")
   
   
   SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1 pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
   
   findspark.init()
   
   spark = SparkSession.builder\
       .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
       .config('className', 'org.apache.hudi') \
       .config('spark.sql.hive.convertMetastoreParquet', 'false') \
       .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
       .config('spark.sql.warehouse.dir', 'file:///C:/tmp/spark_warehouse') \
       .getOrCreate()
   
   
   global faker
   
   faker = Faker()
   
   
   class DataGenerator(object):
   
       @staticmethod
       def get_data():
           return [
               (
                   x,
                   faker.name(),
                   faker.random_element(elements=('IT', 'HR', 'Sales', 'Marketing')),
                   faker.random_element(elements=('CA', 'NY', 'TX', 'FL', 'IL', 'RJ')),
                   faker.random_int(min=10000, max=150000),
                   faker.random_int(min=18, max=60),
                   faker.random_int(min=0, max=100000),
                   faker.unix_time()
               ) for x in range(5)
           ]
   
   
   data = DataGenerator.get_data()
   
   columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
   spark_df = spark.createDataFrame(data=data, schema=columns)
   print(spark_df.show())
   
   db_name = "hudidb"
   table_name = "hudi_table"
   recordkey = 'emp_id'
   precombine = 'ts'
   
   path = "file:///C:/tmp/spark_warehouse"
   method = 'upsert'
   table_type = "COPY_ON_WRITE"
   
   hudi_options = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': 'emp_id',
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.datasource.write.precombine.field': 'ts',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.insert.shuffle.parallelism': 2,
       'hoodie.schema.on.read.enable' : 'true', # for changing column names
   
       'hoodie.write.concurrency.mode':'optimistic_concurrency_control',     #added for zookeeper to deal with multiple source writes
       'hoodie.cleaner.policy.failed.writes':'LAZY',
       # 'hoodie.write.lock.provider':'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',
   
       'hoodie.write.lock.provider':'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider',
       'hoodie.write.lock.zookeeper.url':'localhost',
       'hoodie.write.lock.zookeeper.port':'2181',
       'hoodie.write.lock.zookeeper.lock_key':'my_lock',
       'hoodie.write.lock.zookeeper.base_path':'/hudi_locks',
   
   }
   
   print("*"*55)
   print("over-write")
   print("*"*55)
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("overwrite"). \
       save(path)
   
   print("*"*55)
   print("READ")
   print("*"*55)
   read_df = spark.read. \
       format("hudi"). \
       load(path)
   print(read_df.show())
   
   impleDataUpd = [
       (6, "This is APPEND4", "Sales", "RJ", 81000, 30, 23000, 827307999),
       (7, "This is APPEND4", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
   ]
   
   columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]
   usr_up_df = spark.createDataFrame(data=impleDataUpd, schema=columns)
   
   
   usr_up_df.write.format("hudi").options(**hudi_options).mode("append").save(path)
   
   print("*"*55)
   print("READ")
   print("*"*55)
   read_df = spark.read. \
       format("hudi"). \
       load(path)
   print(read_df.show())


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1400757517

   I tried multi-writers from two diff spark-shells, and one of them fails while writing to hudi. 
   ```
   
   
   scala> df2.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
        |   option("hoodie.cleaner.policy.failed.writes","LAZY").
        |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
        |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
        |   option("hoodie.write.lock.zookeeper.port","2181").
        |   option("hoodie.write.lock.zookeeper.lock_key","locks").
        |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
        |   mode(Append).
        |   save(basePath)
   warning: there was one deprecation warning; re-run with -deprecation for details
   [Stage 14:>                                                         (0 + 3) / 3]# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   23/01/23 10:00:20 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
     at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
     at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85)
     at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
     at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
     at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
     at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
     at org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:79)
     at org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:491)
     at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:234)
     at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:126)
     at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:698)
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:343)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
     ... 75 elided
   Caused by: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
     ... 109 more
   
   scala> 
   
   ```
   Write to hudi fails and next command prompt it seen. 
   
   excerpt from my other shell which succeeded. 
   ```
   scala> df2.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
        |   option("hoodie.cleaner.policy.failed.writes","LAZY").
        |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
        |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
        |   option("hoodie.write.lock.zookeeper.port","2181").
        |   option("hoodie.write.lock.zookeeper.lock_key","locks").
        |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
        |   mode(Append).
        |   save(basePath)
   warning: one deprecation; for details, enable `:setting -deprecation' or `:replay -deprecation'
   # WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   23/01/23 10:00:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
                                                                                   
   scala> 
   ```
   
   If you can provide us w/ reproducible script, would be nice. as of now, its not reproducible from our end 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maikouliujian commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "maikouliujian (via GitHub)" <gi...@apache.org>.
maikouliujian commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1415690889

   > looks like its a spark datasource write. how do you claim that the job does not fail? are you executing it from spark-shell and the command to write to hudi is just stuck?
   
   In my case, I run my job by azkaban hourly scheduled. when multi jobs run, the azkaban job not fail, but always running .
   I see the detail log, the exception happends. Why my azkaban job can not fail?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] maikouliujian commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "maikouliujian (via GitHub)" <gi...@apache.org>.
maikouliujian commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1415678881

   > how are you writing to hudi. Can you give us some reproducible script. Is it spark-datasource, or spark streaming, detalstreamer, spark-sql. atleast from spark-shell, when you are using spark-ds writer, we know the command fails on conflicts.
   
   In my case,I run my job by spark-ds writer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1549798063

   @tomyanth 
    I tried to run with 0.13.0 version and I didn't had any issues like spark-shell job getting stuck.  Can you try with 0.13.0 if you still face the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] tomyanth commented on issue #7653: [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes

Posted by "tomyanth (via GitHub)" <gi...@apache.org>.
tomyanth commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1538084069

   C:\Users\User\Desktop\hudi\TestAppend copy.ipynb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "SamarthRaval (via GitHub)" <gi...@apache.org>.
SamarthRaval commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1751779534

   Hello @Jason-liujc @maikouliujian 
   
   I am seeing exact error, and also using dynamoDB lock same way as last comment, were you guys able to figure out the work around for it ?
   
   Or anything to fix this issue ?, facing something very similar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [SUPPORT]java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes [hudi]

Posted by "SamarthRaval (via GitHub)" <gi...@apache.org>.
SamarthRaval commented on issue #7653:
URL: https://github.com/apache/hudi/issues/7653#issuecomment-1765269597

   I heard that DynamoDB lock provider doesn't work with retries, but zookeeper does ?
   If anyone has knowledge about this, would mind sharing here ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org