You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/13 22:28:06 UTC

[GitHub] [hudi] asheeshgarg opened a new issue #1825: [SUPPORT] Compaction of parquet and meta file

asheeshgarg opened a new issue #1825:
URL: https://github.com/apache/hudi/issues/1825


   Setup org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
   Client PySpark
   Storage S3:
   
   I have few dataset arriving at different time of the day lets say 500 datasets each day. Each of the datasets having mostly independent small lets say 5000 rows data but of the same structure column wise. I have partition the data using date column.
   Objectively I am looking at inline compaction so that the data get compacted each time we write into one parquet file at end of the day and we have one parquet file and rest of the older parquet files get deleted.
   Following is the hudi options I have used with pyspark
   hudi_options = {
   --
     | "hoodie.table.name": self.table_name,
     | "hoodie.datasource.write.table.type": "MERGE_ON_READ",
     | "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
     | "hoodie.datasource.write.recordkey.field": "snapshot_date,dataset,column",
     | "hoodie.datasource.write.precombine.field": "column",
     | "hoodie.datasource.write.table.name": self.table_name,
     | "hoodie.compact.inline": True,
     | "hoodie.compact.inline.max.delta.commits": 1,
     | "hoodie.upsert.shuffle.parallelism": 2,
     | "hoodie.insert.shuffle.parallelism": 2,
     | "hoodie.embed.timeline.server": False,
     | "hoodie.datasource.write.partitionpath.field": "snapshot_date",
     | }
   
   I see writes get succeeded but I see multiple parquet files under the s3 location for a given date. Do I need to add any property in spark hudi options to achieve what I am looking at to compact meta and parquet file into one files?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-665040109


   @bvaradar so even if I change the partition such that I have a different partition per day for different datasets so that only one write happens in the partition does it still going to be issue in 0.5.3?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659804284


   @asheeshgarg : 
   Just to be on consistent terminology, the 2 physical files you listed is 2 different versions of the same file. So, queries will be seeing one parquet file. This is fine and expected if we are ingesting only few records per batch which can all fit in a single file. 
   
   Some questions  : Are you ingesting each source dataset as a separate batch to Hudi ? If so, I only see 3 commits ? Can you paste the contents of 3 commit files : 20200716171413.commit, 20200716170252.commit and 20200716154733.commit and let me know how many records were you expecting to ingest. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-665120755


   @asheeshgarg : Yes, Currently, concurrent writing could interfere with one another as part of automatic rollback process. We are revamping this in 0.6 which will allow parallel writing across partitions. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658422907


   @bvaradar you are right we are looking for clustering. Do you have anytime line in mind when this will be available or any branch to look at.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662530755


   @bvaradar  the content of .hoodie is listed at https://gist.github.com/asheeshgarg/8897de60ab6ba78b5847f5432a4a69dd
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662543343


   @asheeshgarg : I can only see rollback files here. These should be cleaned up when the HUDI-1118 is added. BTW, this actually points out that you are seeing (or had seen) lots of failure. This is not normal and you would have to see what the failures when you ingest the data. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1825:
URL: https://github.com/apache/hudi/issues/1825


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658760384


   @asheeshgarg Meanwhile, you can setup the configs as suggested in  https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658881320


   @bvaradar I run with the above understanding where I set the small file size limit to 500 MB to match the 500 datasets but after write I see no change in the behavior it still creating different 1 mb parquet files. Please suggest


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658757612


   @asheeshgarg : Clustering is planned for 0.7 release. We are currently working on getting 0.6 release out at the end of this month. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-663112473


   @asheeshgarg : Yes, Hudi only supports single writer.This means you need to be running only one ingestion job at a time. Hudi takes care of running asynchronous background jobs like cleaner, archiving and compaction. Note that Hudi currently mandates single writer in-order to provide row level incremental changelogs. 
   With the next release - 0.6.0, Hudi will allow concurrent ingestion safely as long as users can guarantee that each concurrent ingestion jobs is writing to different physical partitions of the dataset.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659558896


   @bvaradar Balaji I set the hoodie.cleaner.commits.retained:1 after that I see only two parquet in the filesystem. But when I load the partition using the spark I don't see all the data. For example let say I have loaded 5 datasets which are unique I only see 3 datasets.
   Any suggestions?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659052418


   @asheeshgarg : The 2 parquet files you have listed are essentially different versions of the same file (file_id : 65254296-10d0-49d4-b168-6708e6274712-0_0). Each time you write, only one of this would have been created. Just to avoid confusion, can you elaborate on what you expected to see ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661195131


   @bvaradar  thanks Balaji for your continuous support will test this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg edited a comment on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg edited a comment on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659558896


   @bvaradar Balaji I set the hoodie.cleaner.commits.retained:1 after that I see only two parquet in the filesystem. But when I load the partition using the spark I don't see all the data. For example let say I have loaded 5 datasets which are unique I only see 3 datasets.
   Any suggestions?
   Parquet File
   2020-07-16 17:03:00  494.8 KiB 65254296-10d0-49d4-b168-6708e6274712-0_0-30-247_20200716170252.parquet
   2020-07-16 17:14:20  494.9 KiB 65254296-10d0-49d4-b168-6708e6274712-0_0-30-252_20200716171413.parquet
   
   These are distinct commits I see in spark
   +-------------------+
   |_hoodie_commit_time|
   +-------------------+
   |     20200716171413|
   |     20200715202927|
   |     20200716154733|
   +-------------------+
   
   These are the content of .hoddie
   temptest/hudi/.hoodie/.aux',
    'temptest/hudi/.hoodie/.temp',
    'temptest/hudi/.hoodie/20200715202927.commit',
    'temptest/hudi/.hoodie/20200715202927.commit.requested',
    'temptest/hudi/.hoodie/20200715202927.inflight',
    'temptest/hudi/.hoodie/20200715204132.commit',
    'temptest/hudi/.hoodie/20200715204132.commit.requested',
    'temptest/hudi/.hoodie/20200715204132.inflight',
    'temptest/hudi/.hoodie/20200716154733.clean',
    'temptest/hudi/.hoodie/20200716154733.clean.inflight',
    'temptest/hudi/.hoodie/20200716154733.clean.requested',
    'temptest/hudi/.hoodie/20200716154733.commit',
    'temptest/hudi/.hoodie/20200716154733.commit.requested',
    'temptest/hudi/.hoodie/20200716154733.inflight',
    'temptest/hudi/.hoodie/20200716162313.clean',
    'temptest/hudi/.hoodie/20200716162313.clean.inflight',
    'temptest/hudi/.hoodie/20200716162313.clean.requested',
    'temptest/hudi/.hoodie/20200716162313.commit',
    'temptest/hudi/.hoodie/20200716162313.commit.requested',
    'temptest/hudi/.hoodie/20200716162313.inflight',
    'temptest/hudi/.hoodie/20200716163952.commit',
    'temptest/hudi/.hoodie/20200716163952.commit.requested',
    'temptest/hudi/.hoodie/20200716163952.inflight',
    'temptest/hudi/.hoodie/20200716170252.clean',
    'temptest/hudi/.hoodie/20200716170252.clean.inflight',
    'temptest/hudi/.hoodie/20200716170252.clean.requested',
    'temptest/hudi/.hoodie/20200716170252.commit',
    'temptest/hudi/.hoodie/20200716170252.commit.requested',
    'temptest/hudi/.hoodie/20200716170252.inflight',
    'temptest/hudi/.hoodie/20200716171413.clean',
    'temptest/hudi/.hoodie/20200716171413.clean.inflight',
    'temptest/hudi/.hoodie/20200716171413.clean.requested',
    'temptest/hudi/.hoodie/20200716171413.commit',
    'temptest/hudi/.hoodie/20200716171413.commit.requested',
    'temptest/hudi/.hoodie/20200716171413.inflight',


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] tooptoop4 commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

tooptoop4 commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661955977


   i'm facing the same entries under .hoodie 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662709245


   @bvaradar mostly I see 
   : org.apache.hudi.exception.HoodieRollbackException: Found in-flight commits after time :20200722052838, please rollback greater commits first
   
   Does it imply that a transaction has started and in between that another commit has happened?  We are getting data in batches and there are times when there will be multiple writes active for large batch. Please suggest
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658837587


   @bvaradar Thanks for quick response Balaji. To understand it correctly let me quickly run with an example
   The data that is generated for a dataset will be in some range of 1MB for each 500 datasets. I had set the following properties
   "hoodie.parquet.small.file.limit": 2*1024*1024,
    "hoodie.parquet.max.file.size": 2*1024*1024*1024,
   So to understand correctly  when the data write happens the size 1 MB less than the 2 MB small file limit the first parquet written will be 1 MB. The second write of the data which is another 1 MB should merge to existing parquet. For the 3 write the data will be 1 MB but the first partition is already reached the 2 MB so second parquet will be created? 
   Where does maxsize will be used in this process?
   
   Also this will happen automatically or I need to specify some other properties to take this into effect apart from the two properties I have specified. 
   
   Also if want to contribute to the development of the clustering feature what will be the process for it?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662628377


   @asheeshgarg : Yes, you should see that the spark job failed and its logs should tell you what is wrong.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659064590


   @bvaradar I was assuming that every time we write the content will merged to the existing file based on the size limits we have specify. Other wise we will see lot small files. As from the files you see there are roughly 0.5 MB each so when we write full datastes I will have 500 of those which is what I want to avoid. 
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658896451


   @asheeshgarg :  This sounds like a tuning problem. Please see https://github.com/apache/hudi/issues/1583#issuecomment-622894674 and https://github.com/apache/hudi/issues/654#issuecomment-489742356 
   
   Thanks for your interest in contributing to clustering. You can start with going through the RFC page and looking at the jiras associated with it. Here is the link : https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659001436


   @bvaradar  Balaji I tried the mentioned property but doesn't see the impact still see parquet generated
   
   2020-07-15 20:41:40  478.6 KiB 65254296-10d0-49d4-b168-6708e6274712-0_0-30-724_20200715204132.parquet
   2020-07-15 20:29:35  456.3 KiB 65254296-10d0-49d4-b168-6708e6274712-0_0-30-819_20200715202927.parquet
   
   Here are my writer hudi configs
   
   hudi_options = {
   
     | "hoodie.table.name": self.table_name,
     | "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
     | "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
     | "hoodie.datasource.write.recordkey.field": "dl_snapshot_date,dl_dataset,column",
     | "hoodie.datasource.write.precombine.field": "column",
     | "hoodie.datasource.write.table.name": self.table_name,
     | "hoodie.copyonwrite.record.size.estimate":128,
     | "hoodie.parquet.small.file.limit": 500*1024*1024,
     | "hoodie.parquet.max.file.size": 2*1024*1024*1024,
     | "hoodie.upsert.shuffle.parallelism": 2,
     | "hoodie.insert.shuffle.parallelism": 2,
     | "hoodie.embed.timeline.server": False,
     | "hoodie.datasource.write.partitionpath.field": "dl_snapshot_date",
     | }
   
   Could you please suggest if anything is wrong?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg edited a comment on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg edited a comment on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659558896


   @bvaradar Balaji I set the hoodie.cleaner.commits.retained:1 after that I see only two parquet in the filesystem. But when I load the partition using the spark I don't see all the data. For example let say I have loaded 5 datasets which are unique I only see 3 datasets.
   Any suggestions?
   These are distinct commits I see in spark
   +-------------------+
   |_hoodie_commit_time|
   +-------------------+
   |     20200716171413|
   |     20200715202927|
   |     20200716154733|
   +-------------------+
   
   These are the content of .hoddie
   temptest/hudi/.hoodie/.aux',
    'temptest/hudi/.hoodie/.temp',
    'temptest/hudi/.hoodie/20200715202927.commit',
    'temptest/hudi/.hoodie/20200715202927.commit.requested',
    'temptest/hudi/.hoodie/20200715202927.inflight',
    'temptest/hudi/.hoodie/20200715204132.commit',
    'temptest/hudi/.hoodie/20200715204132.commit.requested',
    'temptest/hudi/.hoodie/20200715204132.inflight',
    'temptest/hudi/.hoodie/20200716154733.clean',
    'temptest/hudi/.hoodie/20200716154733.clean.inflight',
    'temptest/hudi/.hoodie/20200716154733.clean.requested',
    'temptest/hudi/.hoodie/20200716154733.commit',
    'temptest/hudi/.hoodie/20200716154733.commit.requested',
    'temptest/hudi/.hoodie/20200716154733.inflight',
    'temptest/hudi/.hoodie/20200716162313.clean',
    'temptest/hudi/.hoodie/20200716162313.clean.inflight',
    'temptest/hudi/.hoodie/20200716162313.clean.requested',
    'temptest/hudi/.hoodie/20200716162313.commit',
    'temptest/hudi/.hoodie/20200716162313.commit.requested',
    'temptest/hudi/.hoodie/20200716162313.inflight',
    'temptest/hudi/.hoodie/20200716163952.commit',
    'temptest/hudi/.hoodie/20200716163952.commit.requested',
    'temptest/hudi/.hoodie/20200716163952.inflight',
    'temptest/hudi/.hoodie/20200716170252.clean',
    'temptest/hudi/.hoodie/20200716170252.clean.inflight',
    'temptest/hudi/.hoodie/20200716170252.clean.requested',
    'temptest/hudi/.hoodie/20200716170252.commit',
    'temptest/hudi/.hoodie/20200716170252.commit.requested',
    'temptest/hudi/.hoodie/20200716170252.inflight',
    'temptest/hudi/.hoodie/20200716171413.clean',
    'temptest/hudi/.hoodie/20200716171413.clean.inflight',
    'temptest/hudi/.hoodie/20200716171413.clean.requested',
    'temptest/hudi/.hoodie/20200716171413.commit',
    'temptest/hudi/.hoodie/20200716171413.commit.requested',
    'temptest/hudi/.hoodie/20200716171413.inflight',


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-660415951


   @asheeshgarg : 
   1. We retain latest 2 commits to ensure concurrent read queries don't fail intermittently. With subsequent writes, they will be cleaned up. 
   2. Archiving commit metadata is the job of Archival process : Please look at https://hudi.apache.org/docs/configurations.html#archiveCommitsWith for setting configurations


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662170951


   With 0.5.[1/2], Hudi stopped using renames for state transition. Hence, you are seeing separate state files for each action. All these files (except rollback) will be cleaned up as part of archiving. 
   
   For rollback , here is the tracker ticket to do https://issues.apache.org/jira/browse/HUDI-1118.
   
   Would you mind listing .hoodie completely and provide as a link ? Note that if there are any pending compactions, then archiving will not archive any commits before an earliest pending compaction. Also, note that the min/max values are for each actions (commit, clean, ...). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658221173


   @asheeshgarg :  I think what you are looking for is clustering (not compaction) of files which is under development (Please see https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance). To your setup, a good strategy would be to have a single hudi writer read one or more of these datasets and ingest to hudi. Hudi supports file sizing - Please see https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles for more details. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-661874341


   @bvaradar so the insert are looking fine now the COW compaction is generating 2 parquet file for each date.
   I also set the following properties
   
   "hoodie.keep.min.commits": 2,
   "hoodie.keep.max.commits": 4,
   But still I see lot of entries get accumulated under .hoodie like
   2020-07-21 12:20:44    0 Bytes 20200721122042.rollback.inflight
   2020-07-21 12:21:45    1.2 KiB 20200721122143.rollback
   2020-07-21 12:21:45    0 Bytes 20200721122143.rollback.inflight
   2020-07-21 12:31:05    1.0 KiB 20200721123102.rollback
   2020-07-21 12:31:05    0 Bytes 20200721123102.rollback.inflight
   .......
   2020-07-21 13:43:15  950 Bytes 20200721134301.clean.inflight
   2020-07-21 13:43:15  950 Bytes 20200721134301.clean.requested
   2020-07-21 13:43:12    3.9 KiB 20200721134301.commit
   2020-07-21 13:43:02    0 Bytes 20200721134301.commit.requested
   2020-07-21 13:43:06    1.0 KiB 20200721134301.inflight
   
   Do I need to set anything to clean up?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658188686


   @bvaradar Balaji please let me know if I need to assign additional properties to achieve the behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-660174273


   @bvaradar I think some how there was a cleanup issue after cleanup all the files and setting
   "hoodie.cleaner.commits.retained":1, I see two parquet files consistently so this setting works. 
    Thanks for all your responses.
   1) Now two questions is there a way to get rid of previous 1 commit I set "hoodie.cleaner.commits.retained":0 seems its disable it
   2) Is there way to run compaction for the under the .hoddie folder as there are lot of commit info accumulating under it?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-659206804


   @asheeshgarg : To grow an existing file, hudi creates a new file-version of the file and doesnt try modifying the existing file (https://hudi.apache.org/docs/concepts.html). This is to ensure snapshot isolation to readers who are reading the query when writes are happening. Note that removing older file versions is the responsibility of Hoodie Cleaner (https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo). If you do not want to have lot of previous versions of the file lying around, you can go ahead and set the cleaner config as specified in the above link. Hope this helps. Let me know if this did not help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

Posted by GitBox <gi...@apache.org>.

asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-662612691


   @bvaradar you are suggesting look at the spark logs during ingestion or any other logs?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org