You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "bhasudha (via GitHub)" <gi...@apache.org> on 2023/02/21 07:40:09 UTC

[GitHub] [hudi] bhasudha commented on a diff in pull request #7985: [DOCS] Update clustering docs

bhasudha commented on code in PR #7985:
URL: https://github.com/apache/hudi/pull/7985#discussion_r1112664856


##########
website/docs/clustering.md:
##########
@@ -51,8 +62,147 @@ NOTE: Clustering can only be scheduled for tables / partitions not receiving any
 ![Clustering example](/assets/images/blog/clustering/example_perf_improvement.png)
 _Figure: Illustrating query performance improvements by clustering_
 
-### Setting up clustering
-Inline clustering can be setup easily using spark dataframe options. See sample below
+## Clustering Usecases
+
+### Batching small files
+
+As mentioned in the intro, streaming ingestion generally results in smaller files in your data lake. But having a lot of
+such small files could bring down your query latency. From our experience supporting community users, there are quite a

Review Comment:
   bring down query latency is confusing. I think what you meant was affect query latencies ? Or we can even go with higher query latencies ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org