You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "codope (via GitHub)" <gi...@apache.org> on 2023/02/12 06:29:42 UTC

[GitHub] [hudi] codope commented on pull request #7923: [HUDI-5769] Fixing deletion of metadata partitions based on explicit configs set by user

codope commented on PR #7923:
URL: https://github.com/apache/hudi/pull/7923#issuecomment-1426954380

   I do not think this is a bug as we have called out this behavior in the async indexer docs - https://hudi.apache.org/docs/metadata_indexing#caveats
   That said, I think this issue is a little tricky and we need to address holistically. The two main problems are:
   1. Special treatment of `files` index. It is coupled to metadata enable flag and created synchronously when the flag is enabled.
   2. Different sources of truth for which indexes are enabled (table config and writer config).
   
   I would like to borrow the best practice from [database systems](https://www.postgresql.org/docs/current/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY) (and our approach is partially in line with that). 
   Typically, they maintain what indexes are valid/invalid (completed/inflight) in system catalogs (table configs).
   In Hudi, we should move towards table config as the single source of truth for what metadata indexes are valid/invalid.
   - If a user has enabled metadata, then table config tells the system what indexes are ready for updates and reads. Of course, when metadata is enabled the first time, `files` index will be created synchronously and only this index will be present in table config. But, we should think about building files index async as well.
   - If a user has disabled metadata, then no index is used irrespective of the table config.
   - If a user has enabled metadata through the writer, and another index, like column stats, through the indexer, then table config serves as the source of truth for all subsequent writes (even in the multi-writer scenario).
   - If a user wants to disable one of the indexes, we should provide an API to update table configs atomically. Once updated, table configs will continue to serve as the source of truth. Disabling any index should be an offline operation (much like dropping index).
   
   Need to think through other cases. But, I think we can hold this patch from landing as the issue is not a regression. What do you think? @nsivabalan @xushiyan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org