You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/02 18:00:29 UTC
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5440: [HUDI-3930][Docs] Adding documentation around Data Skipping

alexeykudinkin commented on code in PR #5440:
URL: https://github.com/apache/hudi/pull/5440#discussion_r863061094


##########
website/docs/performance.md:
##########
@@ -60,25 +62,48 @@ For e.g , with 100M timestamp prefixed keys (5% updates, 95% inserts) on a event
 **~7X (2880 secs vs 440 secs) speed up** over vanilla spark join. Even for a challenging workload like an '100% update' database ingestion workload spanning 
 3.25B UUID keys/30 partitions/6180 files using 300 cores, Hudi indexing offers a **80-100% speedup**.
 
-### Snapshot Queries
 
-The major design goal for snapshot queries is to achieve the latency reduction & efficiency gains in previous section,
-with no impact on queries. Following charts compare the Hudi vs non-Hudi tables across Hive/Presto/Spark queries and demonstrate this.
+### Read Path
 
-**Hive**
+#### Data Skipping
+ 
+Data Skipping is a technique (originally introduced in Hudi 0.10) that leverages files metadata to very effectively prune the search space, by 
+avoiding reading (even footers of) the files that are known (based on the metadata) to only contain the data that _does not match_ the query's filters.
 
-<figure>
-    <img className="docimage" src={require("/assets/images/hudi_query_perf_hive.png").default} alt="hudi_query_perf_hive.png"  />
-</figure>
+Data Skipping is leveraging Metadata Table's Column Stats Index bearing column-level statistics (such as min-value, max-value, count of null-values in the column, etc)
+for every file of the Hudi table. This then allows Hudi for every incoming query instead of enumerating every file in the table and reading its corresponding metadata 
+(for ex, Parquet footers) for analysis whether it could contain any data matching the query filters, to simply do a query against a Column Stats Index 
+in the Metadata Table (which in turn is a Hudi table itself) and within seconds (even for TBs scale tables, with 10s of thousands of files) obtain the list 
+of _all the files that might potentially contain the data_ matching query's filters with crucial property that files that could be ruled out as not containing such data
+(based on their column-level statistics) will be stripped out.
 
-**Spark**
+In spirit, Data Skipping is very similar to Partition Pruning for tables using Physical Partitioning where records in the dataset are partitioned on disk
+into a folder structure based on some column's value or its derivative (clumping records together based on some intrinsic measure), but instead
+of on-disk folder structure, Data Skipping leverages index maintaining a mapping "file &rarr; columns' statistics" for all of the columns persisted 
+within that file.
 
-<figure>
-    <img className="docimage" src={require("/assets/images/hudi_query_perf_spark.png").default} alt="hudi_query_perf_spark.png"  />
-</figure>
+For very large tables (1Tb+, 10s of 1000s of files), Data skipping could 
+1. Substantially improve query execution runtime (by avoiding fruitless Compute churn) in excess of **10x** as compared to the same query on the same dataset but w/o Data Skipping enabled.
+2. Help avoid hitting Cloud Storages throttling limits (for issuing too many requests, for ex, AWS limits # of requests / sec that could be issued based on the object's prefix which considerably complicates things for partitioned tables)  
 
-**Presto**
+If you're interested on learning more details around how Data Skipping is working internally please watch out for a blog-post coming out on this soon!  
 
-<figure>
-    <img className="docimage" src={require("/assets/images/hudi_query_perf_presto.png").default} alt="hudi_query_perf_presto.png"  />
-</figure>
+To unlock the power of Data Skipping you will need to
+
+1. Enable Metadata Table along with Column Stats Index on the _write path_ (TODO(alexey) add ref to async indexer)
+2. Enable Data Skipping in your queries
+
+To enable Metadata Table along with Column Stats Index on the write path, make sure 
+following properties are set to true:
+  - `hoodie.metadata.enable` (to enable Metadata Table on the write path, enabled by default)
+  - `hoodie.metadata.index.column.stats.enable` (to enable Column Stats Index being populated on the write path, disabled by default)
+
+TODO(alexey) add ref to async indexer docs
+> NOTE: If you're planning on enabling Column Stats Index for already existing table, please check out Async Indexer documentation
+> on how to build Metadata Table Indexes (such as Column Stats Index) for existing tables
+
+
+To enable Data Skipping in your queries make sure to set following properties to "true" (on the read path): 

Review Comment:
   Can you please elaborate what you have in mind regarding the snippet showing how to use it? DS is mostly an optimization that is supposed to be pruning the search space under the hood of executing queries. The only way users will be interfacing with it is by just specifying the configuration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org