You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "SteNicholas (via GitHub)" <gi...@apache.org> on 2023/02/17 01:44:56 UTC

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7985: [DOCS] Update clustering docs

SteNicholas commented on code in PR #7985:
URL: https://github.com/apache/hudi/pull/7985#discussion_r1109190014


##########
website/docs/clustering.md:
##########
@@ -10,6 +10,17 @@ last_modified_at:
 Apache Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. In a data lake/warehouse, one of the key trade-offs is between ingestion speed and query performance. Data ingestion typically prefers small files to improve parallelism and make data available to queries as soon as possible. However, query performance degrades poorly with a lot of small files. Also, during ingestion, data is typically co-located based on arrival time. However, the query engines perform better when the data frequently queried is co-located together. In most architectures each of these systems tend to add optimizations independently to improve performance which hits limitations due to un-optimized data layouts. This doc introduces a new kind of table service called clustering [[RFC-19]](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance) to reorganize data for i
 mproved query performance without compromising on ingestion speed.
 <!--truncate-->
 
+## How is compaction different from clustering?
+
+Hudi is modeled like a log-structured storage engine with multiple versions of the data.
+Particularly, [Merge-On-Read](/docs/table_types#merge-on-read-table)
+tables in Hudi store data using a combination of base file in columnar format and row-based delta logs that contain
+updates. Compaction is a way to merge the delta logs with base files to produce the latest file slices with the most
+recent snapshot of data. Compaction helps to keep the query performance in check (larger delta log files would incur
+longer merge times on query side). On the other hand, clustering is a data layout optimization technique. One can stitch
+together small files into larger files using clustering. Additionally, data can be clustered by sort key so that queries
+can take advantage of data locality.

Review Comment:
   Does this need to explain than clustering helps to improve the query performance in task reduction and sort reduction on the query side?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org