You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/27 10:03:37 UTC

[GitHub] [hudi] danny0405 commented on a change in pull request #3547: Add Hudi 0.9.0 release page with highlights

danny0405 commented on a change in pull request #3547:
URL: https://github.com/apache/hudi/pull/3547#discussion_r697316833



##########
File path: website/releases/release-0.9.0.md
##########
@@ -0,0 +1,118 @@
+---
+title: "Release 0.9.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2021-08-26T08:40:00-07:00
+---
+# [Release 0.9.0](https://github.com/apache/hudi/releases/tag/release-0.9.0) ([docs](/docs/quick-start-guide))
+
+## Download Information
+* Source Release : [Apache Hudi 0.9.0 Source Release](https://downloads.apache.org/hudi/0.9.0/hudi-0.9.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.9.0/hudi-0.9.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.9.0/hudi-0.9.0.src.tgz.sha512))
+* Apache Hudi jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+## Migration Guide for this release
+- If migrating from release older than 0.5.3, please also check the upgrade instructions for each subsequent release below.
+- Specifically check upgrade instructions for 0.6.0.
+- With 0.9.0 Hudi is adding more table properties to aid in using an existing hudi table with spark-sql. 
+  To smoothly aid this transition these properties added to `hoodie.properties` file. Whenever Hudi is launched with
+  newer table version i.e 2 (or moving from pre 0.9.0 to 0.9.0), an upgrade step will be executed automatically.
+  This automatic upgrade step will happen just once per Hudi table as the `hoodie.table.version` will be updated in 
+  property file after upgrade is completed.
+- Similarly, a command line tool for Downgrading (command - `downgrade`) is added if in case some users want to 
+  downgrade Hudi from table version 2 to 1 or move from Hudi 0.9.0 to pre 0.9.0
+- This release also replaces string representation of all configs to ConfigProperty. Older configs (strings) are deprecated
+and users are encouraged to use the new ConfigProperties. 
+
+## Release Highlights
+
+### Spark SQL DML and DDL Support
+
+0.9.0 brings the ability to perform data definition (create, modify and truncate) and data manipulation (inserts, 
+upserts and deletes) on Hudi tables using the familiar Structured Query Language (SQL) via Spark SQL.
+This is huge step towards making Hudi more easily accessible and operable by all personas (non-engineers, analysts etc).
+Users can now use `INSERT`, `UPDATE`, `MERGE INTO` and `DELETE`
+sql statements to manipulate data. In addition, `INSERT OVERWRITE` statement can be used to overwrite existing data in the table or partition
+with new values. 
+
+On the data definition side, customers can now use `CREATE TABLE....USING HUDI` statement, which creates and registers the
+table as Hudi Spark DataSource table, instead of the earlier approach of having to define Hudi Input Format and Parquet
+Serde. They can also use `CREATE TABLE....AS`, `ALTER TABLE` and `TRUNCATE TABLE` statements to create, modify or truncate table
+definitions.
+
+Please see [RFC-25](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi)
+for more implementation details and follow this [page](/docs/quick-start-guide) to get started with DML and DDL!
+
+### Query side improvements
+- Hudi tables can now be registered adn read as DataSource Table in spark. 
+- **Metadata based File Listing Improvements for Spark**: ?? 
+- Added support for timetravel query. Please check the [quick start page](http://localhost:3000/docs/next/quick-start-guide)
+  on how to invoke the same. 
+
+### Writer side improvements 
+- Virtual keys support has been added where users can avoid adding meta fields to hudi table and leverage existing 
+  fields to populate record keys and partition paths. One needs to disable [this](http://localhost:3000/docs/next/configurations#hoodiepopulatemetafields)
+  config to enable virtual keys. 
+- Clustering improvements:  
+    - Async Clustering support has been added to both DeltaStreamer and Spark Streaming. More on this can be found in this 
+      blog post and how to use the same.
+    - Incremental read works for clustered data as well. ? This is more of a fix right? 
+    - HoodieClusteringJob has been added to assist in building and executing a clustering plan together as a standalone job. 
+    - Added a config (`hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions`) to skip latest N partitions 
+      while creating the cluster plan.  
+- Bulk_Insert using row writer path has been enhanced and brought to feature parity with WriteClient's version of Bulk_insert. 
+Users can start to use the row writer path for better performance. 
+- Added support for HMS in HiveSyncTool. HMSDDLExecutor is a DDLExecutor implementation, based on HMS which uses HMS 
+  apis directly for all DDL tasks.
+- A pre commit validator framework has been added to spark engine. Users can leverage this to add validations like verifying all valid
+files are present for given instant, or all invalid files are removed, etc as part of this framework.
+- Users can choose to drop fields used to generate partition paths if need be. (`hoodie.datasource.write.drop.partition.columns`)
+- Added support for "delete_partition" operation to spark. Users can leverage this to delete older partitions as and when required.    
+- Disk based map improvements? 
+- Concurrency control: https://issues.apache.org/jira/browse/HUDI-944
+- ORC format support: ?
+- Support for Huawei Cloud Object Storage, BAIDU AFS storage format, Baidu BOS storage in Hudi: Should we expand more on this? 
+- Marker file management has been enhanced to use batch mode and is now centrally coordinate by timeline server. This benefits 
+cloud stores like S3 when large number of marker files are created or deleted during commits. You can read more on this blog.
+- Java engine now support HoodieBloomIndex.
+
+### Flink Integration Improvements
+- **Insert Overwrite Support**: HUDI-1788
+- **Bulk Insert Support**: HUDI-2209
+- **Non Partitioned Table Support**: HUDI-1814

Review comment:
       - The Flink writer supports `CDC format` for `MOR` table, turns on the option `changelog.enabled`, Hoodie would then persist all change flags of each record, using the streaming reader of Flink, user can do stateful computation based on these change logs. Note that when the commit is compacted with async compaction service, all the the intermediate changes are merged into one(last record), to only have `UPSERT` semantics.
   - Bulk insert is supported for efficient load of existing table, set up `write.operation` as `bulk_insert` to use
   - You can now streaming read the COW table
   - Delete messages is default emitted in streaming read mode, when `changelog.enabled` is `false`, the downstream receives the `DELETE` message as a Hoodie record with empty payload
   - Flink writer now can update the history partition, i.e. delete the old record in history partiton then insert new record in current partition, turns on `index.global.enabled` to use
   - Hive sync has been greatly improved by support different Hive versions(1.x, 2.x, 3.x)
   - Flink support pure log append mode, in this mode, no records were deduplicated, for both `COW` and `MOR` table, parquets are written directly for each flush, turns off `write.insert.deduplicate` to use this mode




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org