You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/09 12:52:55 UTC

[GitHub] [hudi] rkkalluri commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

rkkalluri commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179538626

   If you check the sql tab of spark-ui or the explain plan for both statements you see that partition pruning is happening and we only read files from 1 partition to satisfy your query. It is the planning stage that differs where spark catalog first needs to know all the partitions that do exist so that it can filter or prune them. That is exactly the problem Hudi metadata will solve and does not have to list all partitions
   
   <img width="478" alt="Screen Shot 2022-07-09 at 7 48 35 AM" src="https://user-images.githubusercontent.com/3401900/178106590-953c045c-9490-480d-ab4e-033789085672.png">
   <img width="478" alt="Screen Shot 2022-07-09 at 7 47 45 AM" src="https://user-images.githubusercontent.com/3401900/178106592-fe0243f3-02c1-47f0-b88e-10bf1da270f9.png">
    and hence the increased performance for you.
   
   >>> spark.read.format("hudi").option("hoodie.metadata.enable","true").load(basePath).filter("part=1").explain()
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet [_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,id#139L,combine#140L,part#141L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#141L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
   
   scala> spark.read.format("hudi").option("hoodie.metadata.enable","false").load(basePath).filter("part=1").explain(false)
   == Physical Plan ==                                                             
   *(1) ColumnarToRow
   +- FileScan parquet [_hoodie_commit_time#65,_hoodie_commit_seqno#66,_hoodie_record_key#67,_hoodie_partition_path#68,_hoodie_file_name#69,id#70L,combine#71L,part#72L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#72L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org