You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/24 05:56:18 UTC

[GitHub] [hudi] Gatsby-Lee opened a new issue #4896: [SUPPORT] Metadata Table causes missing data.

Gatsby-Lee opened a new issue #4896:
URL: https://github.com/apache/hudi/issues/4896


   **Describe the problem you faced**
   
   Regardless the table type ( CoW, MoR ), I notice missing data when Metadata Table is enabled.
   
   For example, If I ingest 100,000 records ( no dups ) with the batch size 10,000, the ingested records in Hudi are not 100,000.
   
   I checked the number or records through Amazon Athena and also double-checked the count by running Spark Job as well.
   
   **Full Configuration**
   
   ```
   {
   	'className': 'org.apache.hudi'
   	'hoodie.datasource.hive_sync.database': 'hudi_exp'
   	'hoodie.datasource.hive_sync.enable': 'true'
   	'hoodie.datasource.hive_sync.support_timestamp': 'true'
   	'hoodie.datasource.hive_sync.table': 'hudi_etl_exp'
   	'hoodie.datasource.hive_sync.use_jdbc': 'false'
   	'hoodie.datasource.write.hive_style_partitioning': 'true'
   	'hoodie.datasource.write.partitionpath.field': 'org_id'
   	'hoodie.datasource.write.recordkey.field': 'obj_id'
   	'hoodie.table.name': 'hudi_etl_exp'
   	'hoodie.bulkinsert.shuffle.parallelism': '24'
   	'hoodie.delete.shuffle.parallelism': '24'
   	'hoodie.insert.shuffle.parallelism': '24'
   	'hoodie.upsert.shuffle.parallelism': '24'
   	'hoodie.index.type': 'BLOOM'
   	'hoodie.bloom.index.prune.by.ranges': 'true'
   	'hoodie.datasource.clustering.async.enable': 'false'
   	'hoodie.datasource.clustering.inline.enable': 'false'
   	'hoodie.datasource.compaction.async.enable': 'false'
   	'hoodie.clean.automatic': 'true'
   	'hoodie.clean.async': 'true'
   	'hoodie.keep.max.commits': 40
   	'hoodie.keep.min.commits': 30
   	'hoodie.cleaner.commits.retained': 20
   	'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'
   	'hoodie.compact.inline': 'false'
   	'hoodie.clustering.async.enabled': 'false'
   	'hoodie.clustering.async.max.commits': 4
   	'hoodie.clustering.inline': 'false'
   	'hoodie.metadata.clean.async': 'true'
   	'hoodie.cleaner.policy.failed.writes': 'LAZY'
   	'hoodie.write.concurrency.mode': 'OPTIMISTIC_CONCURRENCY_CONTROL'
   	'hoodie.write.lock.provider': 'org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider'
   	'hoodie.write.lock.zookeeper.port': '2181'
   	'hoodie.write.lock.zookeeper.url': 'zookeeper_url'
   	'hoodie.write.lock.zookeeper.base_path': 'zookeeper_base_path'
   	'hoodie.write.lock.zookeeper.lock_key': 'hudi_etl_exp'
   	'path': 's3://hello-hudi/hudi_exp/hudi_etl_exp'
   	'hoodie.datasource.write.precombine.field': '_etl_cluster_ts'
   	'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   	'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator'
   	'hoodie.datasource.hive_sync.partition_fields': 'org_id'
   	'hoodie.combine.before.upsert': 'true'
   	'hoodie.datasource.write.operation': 'upsert'
   	'hoodie.datasource.write.table.type': 'COPY_ON_WRITE'
   	'hoodie.table.type': 'COPY_ON_WRITE'
   	'hoodie.metadata.enable': 'true'
   }
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. generates random 100 records
   2. ingest 10 records per batch
   3. count number of ingested records ( 10, 20, 30 )
   
   
   **Expected behavior**
   
   The all 100 records have to be on Hudi Tables
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 3.1.1-amzn-0
   
   * Hive version : 2.3.7-amzn-4
   
   * Hadoop version : 3.2.1-amzn-3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1063182737


   @yihua @nsivabalan 
   
   I haven't really confirmed yet.
   but, I feel this issue happens when the table uses table partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1055017231


   @nsivabalan @yihua Thank you
   Please let me know if you need anything from me about this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1050303381


   Can you not set this config "'hoodie.metadata.clean.async': 'true'". This should not have been exposed to end user. it has to be false. Please remove this and try it out. 
   Let us know how it goes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1050303381


   Can you not set this config "'hoodie.metadata.clean.async': 'true'". This should not have been exposed to end user. it has to be false. Please remove this and try it out. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1062476332


   @nsivabalan : Hi, ok. I will do that.
   BTW, I don't have the data that I reported last time.
   Let me run the experimentation again and populate the data and share with you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1051847785


   @nsivabalan 
   Hi, I ran the test
   
   Test1: no metadata table - OK
   Test2: metadata table + hoodie.metadata.clean.async=true - Missing Data
   Test3: metadata table + hoodie.metadata.clean.async=false - Missing Data
   
   I still see the missing data issue when Metadata table is enabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1054638533


   @yihua : Can you assist in debugging data loss issue w/ metadata table. this is hudi-0.9.0 btw. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4896:
URL: https://github.com/apache/hudi/issues/4896


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1052120522


   ok. when metadata table is enabled, and metadata.clean.async is false:
   
   can you post the contents of .hoodie 
   and .hoodie/metadata.hoodie
   
   and we can go from there. Ensure you when you do "ls", you sort based on file last mod time btw. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1053329848


   @nsivabalan 
   
   Hi, 
   
   here is the content in .hoodie
   
   ```
       0 Feb 26 23:34 archived_$folder$
       0 Feb 26 23:34 .temp_$folder$
     514 Feb 26 23:34 hoodie.properties
       0 Feb 26 23:34 .aux_$folder$
       0 Feb 26 23:34 20220227073429.commit.requested
       0 Feb 26 23:34 .heartbeat_$folder$
   48321 Feb 26 23:34 20220227073429.inflight
   68948 Feb 26 23:35 20220227073429.commit
       0 Feb 26 23:35 metadata_$folder$
       0 Feb 26 23:35 20220227073548.commit.requested
   47671 Feb 26 23:36 20220227073548.inflight
   69025 Feb 26 23:36 20220227073548.commit
       0 Feb 26 23:37 20220227073706.commit.requested
   48322 Feb 26 23:37 20220227073706.inflight
   69957 Feb 26 23:37 20220227073706.commit
       0 Feb 26 23:40 20220227074055.commit.requested
       0 Feb 26 23:41 20220227074100.rollback.inflight
   15444 Feb 26 23:41 20220227074100.rollback
   50285 Feb 26 23:41 20220227074055.inflight
   72571 Feb 26 23:41 20220227074055.commit
       0 Feb 26 23:42 20220227074225.commit.requested
   48970 Feb 26 23:42 20220227074225.inflight
   70917 Feb 26 23:43 20220227074225.commit
       0 Feb 26 23:44 20220227074418.commit.requested
   47670 Feb 26 23:44 20220227074418.inflight
   69084 Feb 26 23:44 20220227074418.commit
       0 Feb 26 23:46 20220227074617.commit.requested
   47013 Feb 26 23:46 20220227074617.inflight
   68146 Feb 26 23:46 20220227074617.commit
       0 Feb 26 23:48 20220227074817.commit.requested
   49625 Feb 26 23:48 20220227074817.inflight
   71895 Feb 26 23:48 20220227074817.commit
       0 Feb 26 23:50 20220227075018.commit.requested
   45706 Feb 26 23:50 20220227075018.inflight
   66287 Feb 26 23:50 20220227075018.commit
       0 Feb 26 23:52 20220227075217.commit.requested
   48977 Feb 26 23:52 20220227075217.inflight
   70975 Feb 26 23:52 20220227075217.commit
       0 Feb 26 23:54 20220227075416.commit.requested
     256 Feb 26 23:54 ..
     128 Feb 26 23:54 .aux
    1376 Feb 26 23:54 .
     192 Feb 26 23:54 metadata
   ```
   
   here is the content of .hoodie/metadata/.hoodie
   ```
      0 Feb 26 23:35 archived_$folder$
    373 Feb 26 23:35 hoodie.properties
      0 Feb 26 23:35 .temp_$folder$
      0 Feb 26 23:35 .aux_$folder$
      0 Feb 26 23:35 20220227073429.deltacommit.requested
   1187 Feb 26 23:35 20220227073429.deltacommit.inflight
   2195 Feb 26 23:35 20220227073429.deltacommit
      0 Feb 26 23:36 20220227073548.deltacommit.requested
   1905 Feb 26 23:36 20220227073548.deltacommit.inflight
   2443 Feb 26 23:36 20220227073548.deltacommit
      0 Feb 26 23:37 20220227073706.deltacommit.requested
   1905 Feb 26 23:37 20220227073706.deltacommit.inflight
   2518 Feb 26 23:37 20220227073706.deltacommit
      0 Feb 26 23:41 20220227074055.deltacommit.requested
   1905 Feb 26 23:41 20220227074055.deltacommit.inflight
   2590 Feb 26 23:41 20220227074055.deltacommit
      0 Feb 26 23:41 20220227074100.deltacommit.requested
    526 Feb 26 23:41 20220227074100.deltacommit.inflight
   1297 Feb 26 23:41 20220227074100.deltacommit
      0 Feb 26 23:43 20220227074225.deltacommit.requested
   1905 Feb 26 23:43 20220227074225.deltacommit.inflight
   2667 Feb 26 23:43 20220227074225.deltacommit
      0 Feb 26 23:45 20220227074418.deltacommit.requested
   1905 Feb 26 23:45 20220227074418.deltacommit.inflight
   2742 Feb 26 23:45 20220227074418.deltacommit
      0 Feb 26 23:47 20220227074617.deltacommit.requested
   1905 Feb 26 23:47 20220227074617.deltacommit.inflight
   2817 Feb 26 23:47 20220227074617.deltacommit
      0 Feb 26 23:49 20220227074817.deltacommit.requested
   1905 Feb 26 23:49 20220227074817.deltacommit.inflight
   2893 Feb 26 23:49 20220227074817.deltacommit
      0 Feb 26 23:50 20220227075018.deltacommit.requested
   1905 Feb 26 23:50 20220227075018.deltacommit.inflight
   2967 Feb 26 23:50 20220227075018.deltacommit
      0 Feb 26 23:52 20220227075217.deltacommit.requested
   1905 Feb 26 23:52 20220227075217.deltacommit.inflight
   3042 Feb 26 23:53 20220227075217.deltacommit
    128 Feb 26 23:54 .aux
   1280 Feb 26 23:54 .
    192 Feb 26 23:54 ..
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1062611610


   @nsivabalan shared the requested content through slack channel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Gatsby-Lee commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
Gatsby-Lee commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1050305189


   @nsivabalan 
   Thank you. I will try with "hoodie.metadata.clean.async": false
   
   Yep. I will let you know the result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1073048786


   hey @Gatsby-Lee : will close out the github issue for now. Once you have confirmed the issue and when you have sample data, feel free to open up a new issue. we can definitely follow up. 
   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4896: [SUPPORT] Metadata Table causes missing data.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4896:
URL: https://github.com/apache/hudi/issues/4896#issuecomment-1060038796


   @Gatsby-Lee : Can you zip just the .hoodie folder contents and give us. Guess ethan has some hudi-cli tools to inspect the timeline. 
   so, .hoodie of data table and .hoodie of metadata table as well. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org