You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/06/02 16:54:43 UTC

[GitHub] [kafka] hachikuji opened a new pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

hachikuji opened a new pull request #8782:
URL: https://github.com/apache/kafka/pull/8782


   The method `maybeWriteTxnCompletion` is unsafe for concurrent calls. This can cause duplicate attempts to write the completion record to the log, which can ultimately lead to illegal state errors and possible to correctness violations if another transaction had been started before the duplicate was written. This patch fixes the problem by ensuring only one thread can successfully remove the pending completion from the map.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#issuecomment-637951275


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#issuecomment-637951668


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434098545



##########
File path: core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala
##########
@@ -285,15 +280,16 @@ class TransactionMarkerChannelManager(config: KafkaConfig,
     }
   }
 
-  private def maybeWriteTxnCompletion(transactionalId: String): Unit = {
-    Option(transactionsWithPendingMarkers.get(transactionalId)).foreach { pendingCommitTxn =>

Review comment:
       Multiple threads may see the transaction still as pending and attempt completion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#issuecomment-638228671


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434153210



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -221,7 +221,6 @@ class TransactionMarkerChannelManagerTest {
     EasyMock.replay(metadataCache)
 
     channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata1, txnMetadata1.prepareComplete(time.milliseconds()))
-    channelManager.addTxnMarkersToSend(coordinatorEpoch, txnResult, txnMetadata2, txnMetadata2.prepareComplete(time.milliseconds()))

Review comment:
       The change to use `mock` instead of `niceMock` led to a failure because of an unexpected append to the log. It seemed like this call was not necessary to test the behavior we were interested in here, so I removed it rather than adding the expected call to append.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434306556



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -86,6 +90,70 @@ class TransactionMarkerChannelManagerTest {
       .anyTimes()
   }
 
+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {

Review comment:
       It fails deterministically without the fix.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434218326



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -86,6 +90,70 @@ class TransactionMarkerChannelManagerTest {
       .anyTimes()
   }
 
+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {

Review comment:
       It does. I was trying to setup this test to fit how we're likely hitting this in practice. In the call to `addTxnMarkersToSend`, before calling `maybeWriteTxnCompletion`, we have to acquire the lock. It is possible that the caller fails to acquire the lock before the markers finish getting written and the transaction gets completed in the request completion handler.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] guozhangwang commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
guozhangwang commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434173517



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -291,7 +358,7 @@ class TransactionMarkerChannelManagerTest {
 
     val response = new WriteTxnMarkersResponse(createPidErrorMap(Errors.NONE))
     for (requestAndHandler <- requestAndHandlers) {
-      requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.PRODUCE, 0, "client", 1),
+      requestAndHandler.handler.onComplete(new ClientResponse(new RequestHeader(ApiKeys.WRITE_TXN_MARKERS, 0, "client", 1),

Review comment:
       Nice catch. Not sure why they did not fail before :)

##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -86,6 +90,70 @@ class TransactionMarkerChannelManagerTest {
       .anyTimes()
   }
 
+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {

Review comment:
       Does this test cover concurrent calls to `maybeWriteTxnCompletion`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#issuecomment-637951729


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] guozhangwang commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
guozhangwang commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434688366



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -86,6 +90,70 @@ class TransactionMarkerChannelManagerTest {
       .anyTimes()
   }
 
+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {

Review comment:
       Ah yes, I missed the `txnMetadata2.lock.lock()` before starting the scheduler, thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] guozhangwang commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
guozhangwang commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434258952



##########
File path: core/src/test/scala/unit/kafka/coordinator/transaction/TransactionMarkerChannelManagerTest.scala
##########
@@ -86,6 +90,70 @@ class TransactionMarkerChannelManagerTest {
       .anyTimes()
   }
 
+  @Test
+  def shouldOnlyWriteTxnCompletionOnce(): Unit = {

Review comment:
       Got it, so when the bug still exist this test would probably not fail consistently, but would be flaky, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #8782:
URL: https://github.com/apache/kafka/pull/8782#discussion_r434038469



##########
File path: core/src/main/scala/kafka/coordinator/transaction/TransactionMarkerChannelManager.scala
##########
@@ -215,8 +215,6 @@ class TransactionMarkerChannelManager(config: KafkaConfig,
   }
 
   private def writeTxnCompletion(pendingCommitTxn: PendingCompleteTxn): Unit = {

Review comment:
       How about renaming ```pendingCommitTxn``` to ```pendingCompleteTxn```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji merged pull request #8782: KAFKA-10080; Fix race condition on txn completion which can cause duplicate appends

Posted by GitBox <gi...@apache.org>.
hachikuji merged pull request #8782:
URL: https://github.com/apache/kafka/pull/8782


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org