You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/11/26 10:12:00 UTC
[jira] [Work logged] (HIVE-25740) Handle race condition between compaction txn abort/commit and heartbeater

     [ https://issues.apache.org/jira/browse/HIVE-25740?focusedWorklogId=686791&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-686791 ]

ASF GitHub Bot logged work on HIVE-25740:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Nov/21 10:11
            Start Date: 26/Nov/21 10:11
    Worklog Time Spent: 10m 
      Work Description: marton-bod commented on a change in pull request #2817:
URL: https://github.com/apache/hive/pull/2817#discussion_r757361747



##########
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##########
@@ -748,6 +736,7 @@ void wasSuccessful() {
      * @throws Exception
      */
     @Override public void close() throws Exception {
+      shutdownHeartbeater();

Review comment:
       > "Theoretically this have the same issue as before the patch, just the other way around. We stop the heartbeat, the transaction times out, and we try to commit / abort."
   
   That's true, this was my first thought as well. However, I think shutting down the heartbeater should be really fast and not cause problems in healthy systems. If it's waiting to be scheduled by the executor (which is most of the time), it will be shut down immediately. Otherwise it'll do one more heartbeating, but that heartbeating would need to take minutes (in line with the value of `hive.txn.timeout`) to cause any problems. If the heartbeating takes that long then we have other issues in the system anyway.
   
   > "How complicated would it be to turn off exception handling in the heart beater instead first, and stop it after abort / commit?"
   
   Can you elaborate on what you mean by turning off exception handling? I think the general problem would remain: we commit/abort the txn and then send a signal to the heartbeater thread to stop doing whatever it's currently doing, but if it has already called the `msc.heartbeat()` method by that point (or it's just about to call it), there's nothing much we can do and it will lead to failure.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 686791)
    Remaining Estimate: 0h
            Time Spent: 10m

> Handle race condition between compaction txn abort/commit and heartbeater
> -------------------------------------------------------------------------
>
>                 Key: HIVE-25740
>                 URL: https://issues.apache.org/jira/browse/HIVE-25740
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marton Bod
>            Assignee: Marton Bod
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is the following: once the compaction worker finishes, commitTxn/abortTxn is invoked first, and the heartbeater thread is only interrupted after that. This can lead to race conditions where the txn has already been deleted from the backend DB via commit/abort, but the concurrently running heartbeater thread still attempts to send a last heartbeat after that, but the txn id won't be found in the DB, leading to {{{}NoSuchTxnException{}}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)