You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Sourabh Badhya (Jira)" <ji...@apache.org> on 2023/05/11 13:46:00 UTC

[jira] [Created] (HIVE-27332) Add retry backoff mechanism for abort cleanup

Sourabh Badhya created HIVE-27332:
-------------------------------------

             Summary: Add retry backoff mechanism for abort cleanup
                 Key: HIVE-27332
                 URL: https://issues.apache.org/jira/browse/HIVE-27332
             Project: Hive
          Issue Type: Sub-task
            Reporter: Sourabh Badhya
            Assignee: Sourabh Badhya


HIVE-27019 and HIVE-27020 added the functionality to directly clean data directories from aborted transactions without using Initiator & Worker. However, during the event of continuous failure during cleanup, the retry mechanism is initiated every single time. We need to add retry backoff mechanism to control the time required to initiate retry again and not continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We create a new table called TXN_CLEANUP_QUEUE with following fields to store the retry metadata -* 
CREATE TABLE TXN_CLEANUP_QUEUE (
TCQ_DATABASE varchar(128) NOT NULL, 
TCQ_TABLE varchar(256) NOT NULL,
TCQ_PARTITION varchar(767), 
TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in postgres / varchar(max) in mssql DB

);

*Advantage: Separates the flow of metadata. We also eliminate the chance of breaking the compaction/abort cleanup when modifying metadata of abort cleanup/compaction. Easier debugging in case of failures.*

*Actions performed by TaskHandler in the case of failure -* 
**

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
{*}CompactionCleaner -{*} 
Action: If compaction on the same table is successful, delete the retry entry in markCleaned when removing any TXN_COMPONENTS entries except when there are no uncompacted aborts. We do not want to be in a situation where there is a queue entry for a table but there is no record in TXN_COMPONENTS associated with the same table.

{*}Advantage: Expecting no performance issues with this approach. Since we delete 1 record most of the times for the associated table/partition.{*}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)