You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/29 01:14:53 UTC

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

nsivabalan commented on code in PR #5440:
URL: https://github.com/apache/hudi/pull/5440#discussion_r861409033


##########
website/docs/performance.md:
##########
@@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction & efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that leverages files metadata to very effectively prune the search space, by 
+avoiding reading (even footers of) the files that are known (based on the metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" src={require("/assets/images/hudi_query_perf_hive.png").default} alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing column-level statistics (such as min-value, max-value, count of null-values in the column, etc)

Review Comment:
   I guess this is available only for COW right? if yes, should we call that out as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org