You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/08 17:59:49 UTC

[GitHub] [hudi] parisni opened a new issue, #6068: [SUPPORT] Partition prunning broken with metadata disable

parisni opened a new issue, #6068:
URL: https://github.com/apache/hudi/issues/6068

   hudi 0.11.1
   spark 3.2.1
   -----------
   
   I have a huge performance drop when disabling metadata table at read time.
   Here is a reproductible example with 2k partitions. (spotted in production with 40k partitions)
   
   ```
   basePath = "/tmp/test_table"
   df = spark.range(1,2000).selectExpr("id", "id as part", "id as combine")
   hudi_options = {
       "hoodie.table.name": "test_table",
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.partitionpath.field": "part",
       "hoodie.datasource.write.table.name": "test_table",
       "hoodie.datasource.write.operation": "bulk_insert",
       "hoodie.datasource.write.precombine.field": "combine",
       "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.metadata.enable": "true",
   }
   (df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
   ```
   
   Then try both (restating spark between two tests)
   ```
   spark.read.format("hudi").option("hoodie.metadata.enable","true").load(basePath).filter("part=1").show()
   spark.read.format("hudi").option("hoodie.metadata.enable","false").load(basePath).filter("part=1").show()
   ```
   
   the former is fast, while the later, is as slow as reading the whole table (no partition prunning)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rkkalluri commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

rkkalluri commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179539062

   "I have a huge performance drop when disabling metadata table at read time."
   
   The above should read more like 
   
   "I have a huge performance improvement when enabling metadata table at read time."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rkkalluri commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

rkkalluri commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179769828

   I am able to force reading of just 1 partition with plain spark like
   
   
   scala> spark.read.format("parquet").option("basePath",basePath).load("/tmp/test_table/1").show()
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-------+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|part|combine|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-------+
   |  20220708143553165|20220708143553165...|                 1|                     1|abb854c5-dbdc-4ec...|  1|   1|      1|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-------+
   
   The equivalent version for hudi is not working though...it is reading data from all partitions and doing catalog file metadata collection like below.
   
   scala> spark.read.format("hudi").option("hoodie.metadata.enable","false").option("basePath",basePath).load("/tmp/test_table/1").show()
   +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+----+
   |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|combine|part|
   +-------------------+--------------------+------------------+----------------------+--------------------+---+-------+----+
   |  20220708143553165|20220708143553165...|               196|                   196|abb854c5-dbdc-4ec...|196|    196| 196|
   |  20220708143553165|20220708143553165...|               199|                   199|abb854c5-dbdc-4ec...|199|    199| 199|
   |  20220708143553165|20220708143553165...|               190|                   190|abb854c5-dbdc-4ec...|190|    190| 190|
   |  20220708143553165|20220708143553165...|               193|                   193|abb854c5-dbdc-4ec...|193|    193| 193|
   |  20220708143553165|20220708143553165...|               194|                   194|abb854c5-dbdc-4ec...|194|    194| 194|
   |  20220708143553165|20220708143553165...|               197|                   197|abb854c5-dbdc-4ec...|197|    197| 197|
   |  20220708143553165|20220708143553165...|               195|                   195|abb854c5-dbdc-4ec...|195|    195| 195|
   |  20220708143553165|20220708143553165...|               198|                   198|abb854c5-dbdc-4ec...|198|    198| 198|
   |  20220708143553165|20220708143553165...|               192|                   192|abb854c5-dbdc-4ec...|192|    192| 192|
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

parisni commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179558219

   @rkkalluri this makes sense ! thanks for your explanations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rkkalluri commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

rkkalluri commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179538626

   If you check the sql tab of spark-ui or the explain plan for both statements you see that partition pruning is happening and we only read files from 1 partition to satisfy your query. It is the planning stage that differs where spark catalog first needs to know all the partitions that do exist so that it can filter or prune them. That is exactly the problem Hudi metadata will solve and does not have to list all partitions
   
   <img width="478" alt="Screen Shot 2022-07-09 at 7 48 35 AM" src="https://user-images.githubusercontent.com/3401900/178106590-953c045c-9490-480d-ab4e-033789085672.png">
   <img width="478" alt="Screen Shot 2022-07-09 at 7 47 45 AM" src="https://user-images.githubusercontent.com/3401900/178106592-fe0243f3-02c1-47f0-b88e-10bf1da270f9.png">
    and hence the increased performance for you.
   
   >>> spark.read.format("hudi").option("hoodie.metadata.enable","true").load(basePath).filter("part=1").explain()
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet [_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,id#139L,combine#140L,part#141L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#141L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
   
   scala> spark.read.format("hudi").option("hoodie.metadata.enable","false").load(basePath).filter("part=1").explain(false)
   == Physical Plan ==                                                             
   *(1) ColumnarToRow
   +- FileScan parquet [_hoodie_commit_time#65,_hoodie_commit_seqno#66,_hoodie_record_key#67,_hoodie_partition_path#68,_hoodie_file_name#69,id#70L,combine#71L,part#72L] Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 paths)[/tmp/test_table], PartitionFilters: [(part#72L = 1)], PushedFilters: [], ReadSchema: struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] parisni closed issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

parisni closed issue #6068: [SUPPORT] Partition prunning broken with metadata disable
URL: https://github.com/apache/hudi/issues/6068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] rkkalluri commented on issue #6068: [SUPPORT] Partition prunning broken with metadata disable

Posted by GitBox <gi...@apache.org>.

rkkalluri commented on issue #6068:
URL: https://github.com/apache/hudi/issues/6068#issuecomment-1179541147

   spark.sql.sources.parallelPartitionDiscovery.threshold=32
   
   may be you can bump up this config to a bigger number since you have 40K partitions to see some improvement  to help with the no metadata situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org