You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/23 02:08:08 UTC

[GitHub] [hudi] Rap70r opened a new issue #4876: Atomic overwrite of multiple files

Rap70r opened a new issue #4876:
URL: https://github.com/apache/hudi/issues/4876


   Hello,
   
   We have a use case where we need to refresh a table in parquet format on S3 environment periodically. The data is too large to create a single parquet file each time so we are using dataframe repartition to generate the output. We need to be able to overwrite the entire dataset atomically. Using Spark alone, it removed the files and then it creates partitions but is not atomic, as the files are generated with eventual consistency manner.
   Is it possible to use Hudi to overwrite the entire dataset in parquet format atomically?
   
   Thank you
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1048897007


   What about if a job is ingesting the files without using Hudi? Would that job get "dirty" data? Do we need to use Hudi for both producer and consumer jobs in order to get consistent data?
   Can you also clarify what's the difference between insert-overwrite-table and insert-overwrite operations?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1048845019


   yes, hudi has insert_overwrite_table write operation through which you can overwrite an entire table. 
   https://hudi.apache.org/docs/quick-start-guide#insert-overwrite
   
   Clean up of actual data files will happen lazily, but overwritten data will not be seen by queries that are issued after the insert_overwrite_table operation is committed successfully. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r edited a comment on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r edited a comment on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1048859966


   Hello @nsivabalan,
   
   Thank you for getting back to me. This looks promising. I will try to create a simple Glue job and test this.
   Question. During clean up, if a job is querying the files, will the job get incomplete data set due to lazy file cleanup?
   Also, did you mean https://hudi.apache.org/docs/writing_data/#insert-overwrite-table?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] pmgod8922 commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

pmgod8922 commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1055231174


   > 
   
   hello  ask a question：
   1、I create sparksession 
   SparkSession sparkSession = SparkSession.builder().enableHiveSupport().appName("HudiTest")
   .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
   .getOrCreate();
   2.Configure hudi parameters
   optionMap.put(DataSourceWriteOptions.RECORDKEY_FIELD().key(),"uuid");
   optionMap.put(DataSourceWriteOptions.PRECOMBINE_FIELD().key(),"dt");
   optionMap.put(DataSourceWriteOptions.PARTITIONPATH_FIELD().key(),"dt");
   optionMap.put("hoodie.insert.shuffle.parallelism","10");
   optionMap.put("hoodie.upsert.shuffle.parallelism","10");
   optionMap.put("hoodie.datasource.write.operation","insert");
   optionMap.put("hoodie.parquet.small.file.limit","94857600");
   optionMap.put("hoodie.parquet.max.file.size","200829120");
   optionMap.put("hoodie.merge.allow.duplicate.on.inserts","true");
   optionMap.put(HoodieWriteConfig.TBL_NAME.key(),"my_table_B");
   3.Output Data To Hdfs
   4.**Found small files not merged**
   
   Can you help me to see if the configuration is missing? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r edited a comment on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r edited a comment on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050277266


   Hi @nsivabalan,
   
   I built a job that runs below code periodically to refresh a table:
   
   ```
   val df = sc.read.load("s3://some_bucket/some_folder/*.parquet")
   
   df.write.format("hudi").
     option("hoodie.datasource.write.operation","insert_overwrite_table").
     option("hoodie.datasource.write.precombine.field", "ts_ms").
     option("hoodie.datasource.write.recordkey.field", "some_key").
     option("hoodie.table.name", "test_table").
     mode(Overwrite).
     save("s3://some_bucket/output_folder/output/")
   ```
   I noticed on each run, it keeps adding more parquet files to the output folder. Not sure why that's happening. The data wasn't changing in the source dataframe before writing.
   Am I doing something wrong here?
   
   First run produced 265 files. Second run increased the files to 303.
   
   The goal is to refresh existing partitioned parquet files atomically.
   
   Thank you
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1048890818


   nope. if you using snapshot queries, you will not be affected by cleaner. but if you are doing incremental query, you may hit some issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1057505142


   thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050277266


   Hi @nsivabalan,
   
   I built a job that runs below code periodically to refresh a table:
   
   ```
   val df = sc.read.load("s3://some_bucket/some_folder/*.parquet")
   
   df.write.format("hudi").
     option("hoodie.datasource.write.operation","insert_overwrite_table").
     option("hoodie.datasource.write.precombine.field", "ts_ms").
     option("hoodie.datasource.write.recordkey.field", "some_key").
     option("hoodie.table.name", "test_table").
     mode("Append").
     save("s3://some_bucket/output_folder/output/")
   ```
   I noticed on each run, it keeps adding more parquet files to the output folder. Not sure why that's happening. The data wasn't changing in the source dataframe before writing.
   Am I doing something wrong here?
   
   The goal is to refresh existing partitioned parquet files atomically.
   
   Thank you
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1055758587


   Got it. Thank you @nsivabalan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050314304

hey hi @Rap70r :
let me explain what happens behind the scenes w/ COW table.

Lets just assume we have only one partition and entire data we plan to ingest fits into one data file.

Commit1:
writes data_file1_v1

Commit2: updates same set of records as commit1.
hudi writes data_file1_v2
// this is a new parquet file which essentially merges new incoming data w/ whats in data_file1_v1 and writes to data_file1_v2.
when you query hudi at this time, only data from data_file1_v2 will be served.

Commit3: again, updates to records from commit1.
hudi writes data_file1_v3.
// similar logic as commit2 above.

But hudi has a cleaner which will take care of cleaning up older file versions.
For eg, hoodie.cleaner.commits.retained is the config to play with. if you set this to 3. At C4, data_file1_V1 will be cleaned up. At C5, data_file1_v2 will be cleaned up.

Now, lets take a look at how insert overwrite works.
Lets say you trigger insert overwrite table at C10.
hudi will create a new file group. data_file2_v1 with just the new incoming records. and mark all previous file groups as invalid.
So, when hudi is queried now, only data from data_file2_v1 will be served and nothing else.

At some later point in time, when cleaner kicks in, it will clean up all invalid file groups.

So, actual clean up is lazy. And thats why you may see more file keeps getting added w/ every commit.

Insert_overwrite_table: entire table contents will be replaced w/ current batch.
insert_overwrite: represents insert overwrite matching partitions. Lets say your hudi table has 1000 partition. and you are ingesting records in 100 partitions with "insert_overwrite", only the 100 partitions will be over written w/ new data. rest 900 will remain intact.

Hope this clarifies things.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r edited a comment on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r edited a comment on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1050277266

Hi @nsivabalan,

I built a job that runs below code periodically to refresh a table:

```
val df = sc.read.load("s3://some_bucket/some_folder/*.parquet")

df.write.format("hudi").
option("hoodie.datasource.write.operation","insert_overwrite_table").
option("hoodie.datasource.write.precombine.field", "ts_ms").
option("hoodie.datasource.write.recordkey.field", "some_key").
option("hoodie.table.name", "test_table").
mode(Overwrite).
save("s3://some_bucket/output_folder/output/")
```
I noticed on each run, it keeps adding more parquet files to the output folder. Not sure why that's happening. The data wasn't changing in the source dataframe before writing.
Am I doing something wrong here?

First run produced 265 files. Second run increased the files to 303. Third run the files increased to 346.
Can you please explain why new files are being generated? Also, is it possible to decrease the number of files being generated?

The goal is to refresh existing partitioned parquet files atomically.

Thank you

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

Rap70r commented on issue #4876:
URL: https://github.com/apache/hudi/issues/4876#issuecomment-1048859966


   Hello @nsivabalan,
   
   Thank you for getting back to me. This looks promising. I will try to create a simple Glue job and test this.
   Question. During clean up, if a job is querying the files, will the job get incomplete data set due to lazy file cleanup?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #4876: Atomic overwrite of multiple files

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #4876:
URL: https://github.com/apache/hudi/issues/4876


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org