You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balaji Varadarajan (Jira)" <ji...@apache.org> on 2019/11/11 17:27:00 UTC

[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

    [ https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971747#comment-16971747 ] 

Balaji Varadarajan commented on HUDI-309:
-----------------------------------------

[https://github.com/apache/incubator-hudi/blob/23b303e4b17c5f7b603900ee5b0d2e6718118014/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L860]
{code:java}
    if (!table.getActiveTimeline().getCleanerTimeline().empty()) {
        logger.info("Cleaning up older rollback meta files");
        // Cleanup of older cleaner meta files
        // TODO - make the commit archival generic and archive rollback metadata         
        FSUtils.deleteOlderRollbackMetaFiles(fs, table.getMetaClient().getMetaPath(),          
                    table.getActiveTimeline().getRollbackTimeline().getInstants());
    }
{code}
 

As part of PR-942, the above code is removed as it is handled elsewhere. Just noting that we need to ensure cleaner commits are also handled correctly for archiving

> General Redesign of Archived Timeline for efficient scan and management
> -----------------------------------------------------------------------
>
>                 Key: HUDI-309
>                 URL: https://issues.apache.org/jira/browse/HUDI-309
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Balaji Varadarajan
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.5.1
>
>         Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of timeframe when the corresponding commit action was tried. Currently, we implicitly assume a data file to be valid if its commit time is older than the earliest time in the active timeline. While this works ok, any inherent bugs in rollback could inadvertently expose a possibly duplicate file when its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a "commit" as special after it gets archived. Examples also include Savepoint handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one directory to another causing the archive folder to grow. We need a way to efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files can be extended to managing archived commit metadata. The idea is to use an optimal format (like HFile) for storing compacted version of <commitTime, Metadata> pairs. Every archiving run will read <commitTime, Metadata> pairs from active timeline and append to indexable log files. We will run periodic minor compactions to merge multiple log files to a compacted HFile storing metadata for a time-range. It should be also noted that we will partition by the action types (commit/clean).  This design would allow for the archived timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)