You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/10/29 10:59:34 UTC

[GitHub] [hudi] stym06 opened a new issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

stym06 opened a new issue #3890:
URL: https://github.com/apache/hudi/issues/3890


   **_Tips before filing an issue_**
   
   **Describe the problem you faced**
   
   Hudi HiveSyncTool did not add some partitions in Hive, and just added the newest partition
   
   **Expected behavior**
   
   It should have added all the partitions
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 2.4.4
   
   * Hive version : 3.2.1
   
   * Hadoop version : 3.1.1
   
   * Storage (HDFS/S3/GCS..) : Azure 
   
   * Running on Docker? (yes/no) : K8s
   
   
   **Additional context**
   
   I run this sync job every day using cron. It did not run for some reasons on 27th and 28th. When I reran the job on 29th, it just added 29th partition, leaving behind the partitions for day 27 and 28. Previous partitions were already created.
   
   **Stacktrace**
   
   * Data and Partitions after day-29 run
   ```
   Found 12 items
   drwxr-xr-x   - root supergroup          0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/.hoodie
   drwxr-xr-x   - root supergroup          0 2021-10-22 10:11 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-19
   drwxr-xr-x   - root supergroup          0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-20
   drwxr-xr-x   - root supergroup          0 2021-10-22 10:10 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-21
   drwxr-xr-x   - root supergroup          0 2021-10-22 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-22
   drwxr-xr-x   - root supergroup          0 2021-10-23 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-23
   drwxr-xr-x   - root supergroup          0 2021-10-24 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-24
   drwxr-xr-x   - root supergroup          0 2021-10-25 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-25
   drwxr-xr-x   - root supergroup          0 2021-10-26 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-26
   drwxr-xr-x   - root supergroup          0 2021-10-27 18:40 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-27
   drwxr-xr-x   - root supergroup          0 2021-10-29 08:21 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-28
   drwxr-xr-x   - root supergroup          0 2021-10-29 10:48 wasb://testblob-v1@testblob.blob.core.windows.net/data/pipelines/hudi/kafka/telemetrics/dp.hmi.quectel.test.data.packet.v2/dt=2021-10-29
   
   hive> show partitions dp_hmi_quectel_test_data_packet_v2;
   OK
   dt=2021-10-19
   dt=2021-10-20
   dt=2021-10-21
   dt=2021-10-22
   dt=2021-10-23
   dt=2021-10-24
   dt=2021-10-25
   dt=2021-10-26
   dt=2021-10-29
   ```
   
   * Day-29 Job logs
   
   ```
   2021-10-29 10:46:02,513 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(190)) - Last commit time synced was found to be 20211026005933
   2021-10-29 10:46:02,513 INFO  [main] common.AbstractSyncHoodieClient (AbstractSyncHoodieClient.java:getPartitionsWrittenToSince(162)) - Last commit time synced is 20211026005933, Getting commits since then
   2021-10-29 10:46:03,070 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncHoodieTable(192)) - Storage partitions scan complete. Found 1
   2021-10-29 10:46:03,070 INFO  [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(895)) - 0: get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
   2021-10-29 10:46:03,071 INFO  [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(347)) - ugi=root	ip=unknown-ip-addr	cmd=get_partitions : tbl=hive.default.dp_hmi_quectel_imu_data_packet_v2
   2021-10-29 10:46:03,104 INFO  [main] hive.HiveSyncTool (HiveSyncTool.java:syncPartitions(333)) - New Partitions [dt=2021-10-29]
   2021-10-29 10:46:03,104 INFO  [main] ddl.HMSDDLExecutor (HMSDDLExecutor.java:addPartitionsToTable(181)) - Adding partitions 1 to table dp_hmi_quectel_imu_data_packet_v2
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #3890:
URL: https://github.com/apache/hudi/issues/3890


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-997541013


   yeah, I inspected the code and looks like there could be a gap. 
   ie. if every day you create new partitions and once you get past the date, if older partitions may never get updated, and if you fail to sync daily, and if archival is aggressive such that it trimmed some commits pertaining to partitions which was never synced, our hive sync tool might miss to sync those partitions. 
   
   From AbstractSyncHoodieClient: 
   ```
         LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ", Getting commits since then");
         return TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
             .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
       }
   ```
   So, here, we look at commits from last synced instant in active timeline and fetch the commit metadata and find the partitions to sync. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-965667234


   @stym06 : can you post the contents of .hoodie. We can deduce if archival is the issue here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] stym06 commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
stym06 commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-997343328


   @nsivabalan I don't have the hoodie folder right now, but what I saw was that when hive sync ran with last_commit_time as 26, it searches all the commit files after 26 and gets the partitions that were written in those particular commits. However, the commits from 27 and 28 were not there in the hoodie folder but 29th day commit file was there. And. in the code, it is written to get all commits after the sync time and find partitions to add. 
   
   As a workaround, I had to add some code change to list wasb folder structure and add the missing partitions that seems to work. Commits from 28 and 29 were archived most probably.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-998448144


   closing the github issue. we will work on fixing it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-956790581


   Can you take this up @codope please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-956790581


   Can you take this up @codope please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-997548317


   https://issues.apache.org/jira/browse/HUDI-3068
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-997134256


   @stym06 : Can you respond to my questions above. would like to get to the bottom of this. But hive sync in general, keeps track of last synced time. so not sure how this could happen. If you were able to resolve the issue, feel free to close it out. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-956790581


   Can you take this up @codope please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-991847640


   @stym06 : Can you respond when you can. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3890: [SUPPORT] Hudi Sync did not add previous partitions

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3890:
URL: https://github.com/apache/hudi/issues/3890#issuecomment-997546283


   If there any particular reason why you do hive sync separately and with large gaps wrt regular commits. we can support this use-case, but for every sync we will be fetching all partitions which might not be ideal for regular users. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org