You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/19 03:09:33 UTC

[GitHub] [hudi] hbgstc123 opened a new issue, #6711: [SUPPORT]Unable to acquire lock when parallelism grows

hbgstc123 opened a new issue, #6711:
URL: https://github.com/apache/hudi/issues/6711

   I meet a situation where hudi table need 2 writers, the other is flink job writing new data, one is for deleting old data.
   
   In flink job, use occ concurrency Control: 
   ```
   'hoodie.write.concurrency.mode'='optimistic_concurrency_control',
   'hoodie.cleaner.policy.failed.writes'='LAZY',
   'hoodie.write.lock.provider'='org.apache.hudi.hive.HiveMetastoreBasedLockProvider',
   'hoodie.write.lock.hivemetastore.database'='db1',
   'hoodie.write.lock.hivemetastore.table'='table1'
   ```
   When run with 4 upsert writer, it run normally.
   But when raise to 64 upsert writers, ckp start to fail occasionally with 
   `Caused by: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null`
   
   Steps to reproduce the behavior:
   
   1.write to a hudi table with flink
   2.set concurrency mode to occ, use hms lock provider
   3.set writer number to 64 or higher
   4.Then some checkpoint fail with HoodieLockException
   
   **Expected behavior**
   
   no error
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Flink version : 1.13
   
   
   **Stacktrace**
   
   ```
   java.io.IOException: Could not perform checkpoint 2 for operator bucket_write: table1 (32/32)#0.
       at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1048)
       at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:135)
       at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:249)
       at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:61)
       at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:435)
       at org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:61)
       at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:226)
       at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:180)
       at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:158)
       at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
       at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
       at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
       at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:782)
       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
       at java.lang.Thread.run(Thread.java:748) 
   Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2 for operator bucket_write: table1 (32/32)#0. Failure reason: Checkpoint was declined.
       at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:264)
       at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
       at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
       at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
       at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
       at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
       at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092)
       at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076)
       at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1032) ... 19 more 
   Caused by: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
       at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:82)
       at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
       at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1458)
       at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1493)
       at org.apache.hudi.client.HoodieFlinkWriteClient.upsert(HoodieFlinkWriteClient.java:138)
       at org.apache.hudi.sink.StreamWriteFunction.lambda$initWriteFunction$1(StreamWriteFunction.java:187)
       at org.apache.hudi.sink.StreamWriteFunction.lambda$flushRemaining$7(StreamWriteFunction.java:466)
       at java.util.LinkedHashMap$LinkedValues.forEach(LinkedHashMap.java:608)
       at org.apache.hudi.sink.StreamWriteFunction.flushRemaining(StreamWriteFunction.java:458)
       at org.apache.hudi.sink.StreamWriteFunction.snapshotState(StreamWriteFunction.java:134)
       at org.apache.hudi.sink.bucket.BucketStreamWriteFunction.snapshotState(BucketStreamWriteFunction.java:100)
       at org.apache.hudi.sink.common.AbstractStreamWriteFunction.snapshotState(AbstractStreamWriteFunction.java:157)
       at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
       at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
       at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:89)
       at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:218) ... 29 more
   ```
   
   I add some log and find the exception thrown because StreamWriteFunction fail to get lock in ```maxRetries``` time which is 10 config by ```hoodie.write.lock.client.num_retries```.
   Do we have a good solution for this condition, like don't lock the table when a writer try to flush data.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1369405532

   Flink OCC is not supported yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] yihua commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

yihua commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1254251993

   @danny0405 do you know if this is an issue specific to Hudi Flink integration?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] gtwuser commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

gtwuser commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1381452564

   @nsivabalan @yihua and @danny0405  We are facing same issue as mentioned above with AWS Glue using below locking configs and just 2 writers. Actually we observed even with 1 writer we are getting this below mentioned errors. 
   
    Kindly provide some pointers 🙏 
   
   
   Locking Configs : 
   ```bash
               'hoodie.write.concurrency.mode': 'optimistic_concurrency_control',
               'hoodie.cleaner.policy.failed.writes': 'LAZY',
               'hoodie.bulkinsert.shuffle.parallelism': 2000,
               'hoodie.write.lock.hivemetastore.database': database,
               'hoodie.write.lock.hivemetastore.table': table_name,
               'hoodie.write.lock.provider': 'org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider',
               'hoodie.write.lock.client.wait_time_ms_between_retry': 50000,
               'hoodie.write.lock.wait_time_ms_between_retry': 20000,
               'hoodie.write.lock.wait_time_ms': 60000,
               'hoodie.write.lock.client.num_retries': 15
   ```
   Error found:
   ```
   Caused by: org.apache.hudi.exception.HoodieException: Unable to acquire lock, lock object null
   ```
   
   Full Stack trace
   ```bash
   2023-01-13 06:24:18,784,784 ERROR    [cxdl4-ldf-kkj-sbx.py:310] error ingesting data: An error occurred while calling o174.save.
   : org.apache.spark.SparkException: Writing job failed.
   	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobFailedError(QueryExecutionErrors.scala:742)
   	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:404)
   	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:353)
   	at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:244)
   	at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:332)
   	at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:331)
   	at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:244)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
   	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:103)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:96)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:615)
   	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:615)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:591)
   	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:96)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:83)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:81)
   	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:124)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:311)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
   	at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:592)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:180)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:144)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:103)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:114)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:139)
   	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:224)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:139)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:245)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:138)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:100)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:96)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:615)
   	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:177)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:615)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:591)
   	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:96)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:83)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:81)
   	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:124)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to acquire lock, lock object null
   	at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:87)
   	at org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite.commit(HoodieDataSourceInternalBatchWrite.java:93)
   	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:392)
   	... 88 more
   	Suppressed: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
   		at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:84)
   		at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
   		at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1452)
   		at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1487)
   		at org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:729)
   		at org.apache.hudi.internal.DataSourceInternalWriterHelper.abort(DataSourceInternalWriterHelper.java:95)
   		at org.apache.hudi.spark3.internal.HoodieDataSourceInternalBatchWrite.abort(HoodieDataSourceInternalBatchWrite.java:98)
   		at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:399)
   		... 88 more
   	Caused by: org.apache.hudi.exception.HoodieLockException: FAILED_TO_ACQUIRE lock at database cxdl4_sbx_ldf_ing_temp_usw2 and table swsc_cx_classic_va_access
   		at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.tryLock(HiveMetastoreBasedLockProvider.java:115)
   		at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:73)
   		... 95 more
   	Caused by: java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException: lock is not supported
   		at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   		at java.util.concurrent.FutureTask.get(FutureTask.java:206)
   		at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.acquireLockInternal(HiveMetastoreBasedLockProvider.java:187)
   		at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.acquireLock(HiveMetastoreBasedLockProvider.java:140)
   		at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.tryLock(HiveMetastoreBasedLockProvider.java:113)
   		... 96 more
   	Caused by: java.lang.UnsupportedOperationException: lock is not supported
   		at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.lock(GlueMetastoreClientDelegate.java:1808)
   		at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.lock(AWSCatalogMetastoreClient.java:1302)
   		at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.lambda$acquireLockInternal$0(HiveMetastoreBasedLockProvider.java:186)
   		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   		... 1 more
   Caused by: org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object null
   	at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:84)
   	at org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
   	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:231)
   	at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:215)
   	at org.apache.hudi.internal.DataSourceInternalWriterHelper.commit(DataSourceInternalWriterHelper.java:84)
   	... 90 more
   Caused by: org.apache.hudi.exception.HoodieLockException: FAILED_TO_ACQUIRE lock at database cxdl4_sbx_ldf_ing_temp_usw2 and table swsc_cx_classic_va_access
   	at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.tryLock(HiveMetastoreBasedLockProvider.java:115)
   	at org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:73)
   	... 94 more
   Caused by: java.util.concurrent.ExecutionException: java.lang.UnsupportedOperationException: lock is not supported
   	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
   	at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.acquireLockInternal(HiveMetastoreBasedLockProvider.java:187)
   	at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.acquireLock(HiveMetastoreBasedLockProvider.java:140)
   	at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.tryLock(HiveMetastoreBasedLockProvider.java:113)
   	... 95 more
   Caused by: java.lang.UnsupportedOperationException: lock is not supported
   	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.lock(GlueMetastoreClientDelegate.java:1808)
   	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.lock(AWSCatalogMetastoreClient.java:1302)
   	at org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider.lambda$acquireLockInternal$0(HiveMetastoreBasedLockProvider.java:186)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	... 1 more
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

danny0405 commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1381524577

   There is a similar issue: https://github.com/apache/hudi/issues/7665, nice ping cc @umehrot2 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1400635017

   This is a known gap in glue wrt supporting hive based locks. Please reach out to aws glue for support. 
   thanks! Feel free to re-open if you think the root cause is something else. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan closed issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows
URL: https://github.com/apache/hudi/issues/6711


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1287909526

   w/ 64 writers, guess the contention is def going to be huge. w/ metadata table, you need to have a lock provider as well. Don't think we can do much here in my understanding. only viable option I can think of is to increase the lock timeouts and retries. 
   but will let Danny and others speak for flink from their exp.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kapjoshi-cisco commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by GitBox <gi...@apache.org>.

kapjoshi-cisco commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1381563852

   > There is a similar issue: #7665, nice ping cc @umehrot2
   
   I added that issues in case this is one missed, hope thats fine. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6711: [SUPPORT]Unable to acquire lock when parallelism grows

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.

nsivabalan commented on issue #6711:
URL: https://github.com/apache/hudi/issues/6711#issuecomment-1400757064

   I tried multi-writers from two diff spark-shells, and one of them fails while writing to hudi. 
   ```
   
   
   scala> df2.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
        |   option("hoodie.cleaner.policy.failed.writes","LAZY").
        |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
        |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
        |   option("hoodie.write.lock.zookeeper.port","2181").
        |   option("hoodie.write.lock.zookeeper.lock_key","locks").
        |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
        |   mode(Append).
        |   save(basePath)
   warning: there was one deprecation warning; re-run with -deprecation for details
   [Stage 14:>                                                         (0 + 3) / 3]# WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   23/01/23 10:00:20 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
     at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
     at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85)
     at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
     at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
     at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
     at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
     at org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:79)
     at org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:491)
     at org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:234)
     at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:126)
     at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:698)
     at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:343)
     at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
     at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
     at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
     at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
     at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
     at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
     at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
     at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
     at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
     at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
     ... 75 elided
   Caused by: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes
     ... 109 more
   
   scala> 
   
   ```
   Write to hudi fails and next command prompt it seen. 
   
   excerpt from my other shell which succeeded. 
   ```
   scala> df2.write.format("hudi").
        |   options(getQuickstartWriteConfigs).
        |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
        |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
        |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
        |   option(TABLE_NAME, tableName).
        |   option("hoodie.write.concurrency.mode","optimistic_concurrency_control").
        |   option("hoodie.cleaner.policy.failed.writes","LAZY").
        |   option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider").
        |   option("hoodie.write.lock.zookeeper.url","localhost:2181").
        |   option("hoodie.write.lock.zookeeper.port","2181").
        |   option("hoodie.write.lock.zookeeper.lock_key","locks").
        |   option("hoodie.write.lock.zookeeper.base_path","/tmp/locks/.lock").
        |   mode(Append).
        |   save(basePath)
   warning: one deprecation; for details, enable `:setting -deprecation' or `:replay -deprecation'
   # WARNING: Unable to attach Serviceability Agent. Unable to attach even with module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense failed.]
   23/01/23 10:00:19 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
                                                                                   
   scala> 
   ```
   
   If you can provide us w/ reproducible script, would be nice. as of now, its not reproducible from our end 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org