You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/19 07:37:57 UTC

[GitHub] [hudi] voonhous opened a new issue, #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

voonhous opened a new issue, #6716:
URL: https://github.com/apache/hudi/issues/6716

   Hello Hudi, this is a question regarding the design considerations between metadata table (MDT) and the archiving commit action on a data table (DT).
   
   When performing archival of commits on the DT,  at least one compaction is required to be performed on the MDT.
   
   ```java
       // If metadata table is enabled, do not archive instants which are more recent than the last compaction on the
       // metadata table.
       if (config.isMetadataTableEnabled()) {
         try (HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(table.getContext(), config.getMetadataConfig(),
             config.getBasePath(), FileSystemViewStorageConfig.SPILLABLE_DIR.defaultValue())) {
           Option<String> latestCompactionTime = tableMetadata.getLatestCompactionTime();
           if (!latestCompactionTime.isPresent()) {
             LOG.info("Not archiving as there is no compaction yet on the metadata table");
             instants = Stream.empty();
           } else {
             LOG.info("Limiting archiving of instants to latest compaction on metadata table at " + latestCompactionTime.get());
             instants = instants.filter(instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.LESSER_THAN,
                 latestCompactionTime.get()));
           }
         } catch (Exception e) {
           throw new HoodieException("Error limiting instant archival based on metadata table", e);
         }
       }
   ```
   
   Assuming that a DT has MDT enabled (by default for Spark entrypoints), and ONLY **INSERT-OVERWRITE** actions are performed on the DT (a table service action generating `replacecommit`s), archival of commits will not be performed.
   
   This is so as compaction on the MDT is never performed if a table service action is performed on the DT. 
   
   As such, one can see that archival service on DT is dependent on MDT's compaction service, which is dependent on DT's data manipulation operations.
   
   TLDR: I am unsure as to what design considerations are involved in putting such restrictions in place, hence am consulting the community as to why this is the case.
   
   Thank you.
   
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.1
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1287927057

   probably there is some mis-understanding. any actions in data table will be applied to metadata table. be it commit, delta commit, clustering, insert_overwrite operations, delete_partition, clean, rollback, etc. So, there should not be any issues. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
voonhous commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1254590318

   @yihua Thank you for the reply. 
   
   > Is the INSERT_OVERWRITE the only write action
   Yes, INSERT_OVERWRITE is the only action being performed on the table. i.e. ensuring that an insert always rewrites a certain partition, regardless if the partition exists or not.
   
   For the sake of simplicity, the example below to reproduce this issue does not involve a partitioned table.
   
   ```sql
   drop table dev_data_infra.insert_overwrite_archive_test purge;
   create table if not existsinsert_overwrite_archive_test(
   	id int,
   	name string,
   	price double,
   	_ts long
   ) using hudi 
   tblproperties (
   	type = 'cow',
   	primaryKey = 'id',
   	preCombineField = '_ts'
   ) location 'hdfs://insert_overwrite_archive_test';
   
   -- INSERT_OVERWRITE 64 times
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   INSERT OVERWRITE insert_overwrite_archive_test VALUES (1, "test", 0.00, 1);
   ```
   
   After the INSERT_OVERWRITE operations have completed, we can check the hdfs directory as such:
   
   ```shell
   $ hdfs dfs -ls hdfs://insert_overwrite_archive_test/.hoodie | grep -o '.replacecommit$'
   64
   ```
   
   Let us check the size of the file within the archive folder too:
   ```shell
   $ hdfs dfs -ls hdfs://insert_overwrite_archive_test/.hoodie/archived
   
   ```
   
   The above output should return nothing as no archiving has been done.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1287945284

   https://github.com/apache/hudi/pull/7037


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1287940207

   actually, you are right. we have some bug around this. 
   https://issues.apache.org/jira/browse/HUDI-5078
   will put up a fix shortly. 
   thanks for bringing it up.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1251679200

   @yihua : Can you take this up please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
voonhous commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1255993000

   @yihua I still don't quite understand the:
   >This ensures that all base files in the metadata table are always in sync with the data table (w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table.
   
   Can you please elaborate? thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan closed issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table
URL: https://github.com/apache/hudi/issues/6716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

Posted by GitBox <gi...@apache.org>.
voonhous commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1250675675

   @TengHuo @fengjian428 @hbgstc123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org