You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Ted-Jiang (via GitHub)" <gi...@apache.org> on 2023/05/10 06:09:56 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new issue, #6315: Avoid read parquet index when there is no filter pushdown

Ted-Jiang opened a new issue, #6315:
URL: https://github.com/apache/arrow-datafusion/issues/6315

                 I test in my local M1, seems there is no regression in q16 (which page index prune none data). Set  `DATAFUSION_EXECUTION_PARQUET_ENABLE_PAGE_INDEX=true  target/release/datafusion-cli`
   But i found 
   ```
   without page Index
     ParquetExec: file_groups={10 groups: [[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]}, projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000, elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0, bytes_scanned=2657280, row_groups_pruned=0, page_index_rows_filtered=0, predicate_ev
 aluation_errors=0, num_predicate_creation_errors=0, pushdown_rows_filtered=0, time_elapsed_scanning_until_data=25.893623ms, page_index_eval_time=20ns, time_elapsed_opening=117.212334ms, pushdown_eval_time=20ns, time_elapsed_scanning_total=250.615706ms, time_elapsed_processing=128.193753ms]           
      
   With page index
     ParquetExec: file_groups={10 groups: [[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]}, projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000, elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned=0, bytes_scanned=2903390, num_predicate_creation_errors=0, predica
 te_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0, time_elapsed_processing=133.898828ms, pushdown_eval_time=20ns, time_elapsed_scanning_until_data=17.617834ms, page_index_eval_time=20ns, time_elapsed_opening=88.243249ms, time_elapsed_scanning_total=238.237217ms]                                                                                                           |
   ```
   
   Form the `bytes_scanned` you can see second one scan more bytes (reasonable with page Index). 
   So IMOP , you do the test in  `google cloud machine` which  means will have more latency when fetch bytes than local machine.
   
   From the plan 
   ```
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | CoalescePartitionsExec, metrics=[output_rows=18314, elapsed_compute=25.292µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |   ProjectionExec: expr=[group_alias_0@0 as part.p_brand, group_alias_1@1 as part.p_type, group_alias_2@2 as part.p_size, COUNT(alias1)@3 as supplier_cnt], metrics=[output_rows=18314, elapsed_compute=29.417µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |     AggregateExec: mode=FinalPartitioned, gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1, group_alias_2@2 as group_alias_2], aggr=[COUNT(alias1)], metrics=[output_rows=18314, elapsed_compute=170.210247ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |       CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=85318, elapsed_compute=9.629845ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |         RepartitionExec: partitioning=Hash([Column { name: "group_alias_0", index: 0 }, Column { name: "group_alias_1", index: 1 }, Column { name: "group_alias_2", index: 2 }], 10), input_partitions=10, metrics=[fetch_time=4.378393626s, repart_time=56.449415ms, send_time=729.375µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |           AggregateExec: mode=Partial, gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1, group_alias_2@2 as group_alias_2], aggr=[COUNT(alias1)], metrics=[output_rows=85318, elapsed_compute=398.147584ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |             AggregateExec: mode=FinalPartitioned, gby=[group_alias_0@0 as group_alias_0, group_alias_1@1 as group_alias_1, group_alias_2@2 as group_alias_2, alias1@3 as alias1], aggr=[], metrics=[output_rows=118250, elapsed_compute=370.184626ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |               CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=118250, elapsed_compute=14.044362ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                 RepartitionExec: partitioning=Hash([Column { name: "group_alias_0", index: 0 }, Column { name: "group_alias_1", index: 1 }, Column { name: "group_alias_2", index: 2 }, Column { name: "alias1", index: 3 }], 10), input_partitions=10, metrics=[fetch_time=3.459734628s, repart_time=202.497167ms, send_time=106.304383ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                   AggregateExec: mode=Partial, gby=[p_brand@1 as group_alias_0, p_type@2 as group_alias_1, p_size@3 as group_alias_2, ps_suppkey@0 as alias1], aggr=[], metrics=[output_rows=118250, elapsed_compute=387.699496ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                     CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=118274, elapsed_compute=16.411µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                       HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(Column { name: "ps_suppkey", index: 0 }, Column { name: "s_suppkey", index: 0 })], metrics=[output_rows=118278, build_input_rows=118324, output_batches=13, input_rows=118278, input_batches=13, build_input_batches=20, build_mem_used=11183594, join_time=36.18104ms, build_time=71.978712ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=118324, elapsed_compute=16.599522ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                           RepartitionExec: partitioning=Hash([Column { name: "ps_suppkey", index: 0 }], 10), input_partitions=10, metrics=[fetch_time=2.856884712s, repart_time=44.825252ms, send_time=647.955µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                             ProjectionExec: expr=[ps_suppkey@1 as ps_suppkey, p_brand@3 as p_brand, p_type@4 as p_type, p_size@5 as p_size], metrics=[output_rows=118324, elapsed_compute=39.667µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                               CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=118324, elapsed_compute=47.91µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                                 HashJoinExec: mode=Partitioned, join_type=Inner, on=[(Column { name: "ps_partkey", index: 0 }, Column { name: "p_partkey", index: 0 })], metrics=[output_rows=29581, build_input_rows=800000, output_batches=10, input_rows=29581, input_batches=10, build_input_batches=100, build_mem_used=56076192, join_time=84.478875ms, build_time=281.910325ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=800000, elapsed_compute=7.032531ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                     RepartitionExec: partitioning=Hash([Column { name: "ps_partkey", index: 0 }], 10), input_partitions=10, metrics=[fetch_time=213.551508ms, repart_time=111.779256ms, send_time=2.88029ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                       ParquetExec: file_groups={10 groups: [[Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:0..3110285], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:3110285..6220570], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:6220570..9330855], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:9330855..12441140], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:12441140..15551425], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:15551425..18661710], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:18661710..21771995], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:21771995..24882280], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:24882280..27992565], [Users/yangjiang/tpch-parquet/partsupp/part-0.parquet:27992565..31102845]]}, projection=[ps_partkey, ps_suppkey], metrics=[output_rows=800000, elapsed_compute=10ns, spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned=0, bytes
 _scanned=2903390, num_predicate_creation_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0, time_elapsed_processing=133.898828ms, pushdown_eval_time=20ns, time_elapsed_scanning_until_data=17.617834ms, page_index_eval_time=20ns, time_elapsed_opening=88.243249ms, time_elapsed_scanning_total=238.237217ms]                                                                                                           |
   |                   |                                   CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=29581, elapsed_compute=4.735922ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                     RepartitionExec: partitioning=Hash([Column { name: "p_partkey", index: 0 }], 10), input_partitions=10, metrics=[fetch_time=1.742820835s, repart_time=362.862215ms, send_time=352.200595ms]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                       RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, metrics=[fetch_time=178.274085ms, repart_time=1ns, send_time=14.125µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                                         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=29581, elapsed_compute=2.310501ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                           FilterExec: p_brand@1 != Brand#45 AND p_type@2 NOT LIKE MEDIUM POLISHED% AND Use p_size@3 IN (SET) ([Literal { value: Int32(49) }, Literal { value: Int32(14) }, Literal { value: Int32(23) }, Literal { value: Int32(45) }, Literal { value: Int32(19) }, Literal { value: Int32(3) }, Literal { value: Int32(36) }, Literal { value: Int32(9) }]), metrics=[output_rows=29581, elapsed_compute=57.569834ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                             ParquetExec: file_groups={1 group: [[Users/yangjiang/tpch-parquet/part/part-0.parquet]]}, projection=[p_partkey, p_brand, p_type, p_size], predicate=p_brand@3 != Brand#45 AND p_type@4 NOT LIKE MEDIUM POLISHED% AND Use p_size@5 IN (SET) ([Literal { value: Int32(49) }, Literal { value: Int32(14) }, Literal { value: Int32(23) }, Literal { value: Int32(45) }, Literal { value: Int32(19) }, Literal { value: Int32(3) }, Literal { value: Int32(36) }, Literal { value: Int32(9) }]), pruning_predicate=(p_brand_min@0 != Brand#45 OR Brand#45 != p_brand_max@1) AND (p_size_min@2 <= 49 AND 49 <= p_size_max@3 OR p_size_min@2 <= 14 AND 14 <= p_size_max@3 OR p_size_min@2 <= 23 AND 23 <= p_size_max@3 OR p_size_min@2 <= 45 AND 45 <= p_size_max@3 OR p_size_min@2 <= 19 AND 19 <= p_size_max@3 OR p_size_min@2 <= 3 AND 3 <= p_size_max@3 OR p_size_min@2 <= 36 AND 36 <= p_size_max@3 OR p_size_min@2 <= 9 AND 9 <= p_size_max@3), metrics=[output_
 rows=200000, elapsed_compute=1ns, spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned=0, bytes_scanned=935235, num_predicate_creation_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0, time_elapsed_processing=114.579335ms, pushdown_eval_time=2ns, time_elapsed_scanning_until_data=20.304792ms, page_index_eval_time=184.001µs, time_elapsed_opening=3.196208ms, time_elapsed_scanning_total=179.528835ms] |
   |                   |                         CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=4, elapsed_compute=16.377µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                           RepartitionExec: partitioning=Hash([Column { name: "s_suppkey", index: 0 }], 10), input_partitions=10, metrics=[fetch_time=377.902418ms, repart_time=29.384µs, send_time=3.968µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
   |                   |                             RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1, metrics=[fetch_time=37.565626ms, repart_time=1ns, send_time=3.667µs]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                               ProjectionExec: expr=[s_suppkey@0 as s_suppkey], metrics=[output_rows=4, elapsed_compute=8.25µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                                 CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=4, elapsed_compute=43.582µs, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
   |                   |                                   FilterExec: s_comment@1 LIKE %Customer%Complaints%, metrics=[output_rows=4, elapsed_compute=21.399292ms, spill_count=0, spilled_bytes=0, mem_used=0]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                     ParquetExec: file_groups={1 group: [[Users/yangjiang/tpch-parquet/supplier/part-0.parquet]]}, projection=[s_suppkey, s_comment], predicate=s_comment@6 LIKE %Customer%Complaints%, metrics=[output_rows=10000, elapsed_compute=1ns, spill_count=0, spilled_bytes=0, mem_used=0, row_groups_pruned=0, bytes_scanned=221440, num_predicate_creation_errors=0, predicate_evaluation_errors=0, page_index_rows_filtered=0, pushdown_rows_filtered=0, time_elapsed_processing=12.006207ms, pushdown_eval_time=2ns, time_elapsed_scanning_until_data=12.450667ms, page_index_eval_time=251ns, time_elapsed_opening=2.942125ms, time_elapsed_scanning_total=34.518458ms]                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   |                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
   +-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row in set. Query took 0.534 seconds.
   ```
   I found there is one place need improvement, which read `  ParquetExec: file_groups={10 groups: [[Users/yangjiang/tpch-parquet/partsup` **without filter pushdown still read the pageIndex bytes** 🤣 
   
   So i want to improve this and retest this in cloud env, @alamb Is this reasonable 🤔  PATL
   
   --------
   New Edit, i found even in cloud env, your test data file still on local disk 🤦 , but i think this still need improve
   
   _Originally posted by @Ted-Jiang in https://github.com/apache/arrow-datafusion/issues/5099#issuecomment-1539555148_
               


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #6315: Avoid read parquet index when there is no filter pushdown

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #6315: Avoid read parquet index when there is no filter pushdown
URL: https://github.com/apache/arrow-datafusion/issues/6315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai commented on issue #6315: Avoid read parquet index when there is no filter pushdown

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #6315:
URL: https://github.com/apache/arrow-datafusion/issues/6315#issuecomment-1543250455

   Sorry, now I see that I didn't notice the different name of the timings, because of the changed order (time_elapsed_processing vs time_elapsed_scanning_total). Now it makes sense


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alippai commented on issue #6315: Avoid read parquet index when there is no filter pushdown

Posted by "alippai (via GitHub)" <gi...@apache.org>.

alippai commented on issue #6315:
URL: https://github.com/apache/arrow-datafusion/issues/6315#issuecomment-1542908897

   Not reading is always the best, but I'm surprised reading a 25kb page index takes 100ms. Is that expected?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on issue #6315: Avoid read parquet index when there is no filter pushdown

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.

Ted-Jiang commented on issue #6315:
URL: https://github.com/apache/arrow-datafusion/issues/6315#issuecomment-1543248742

   > Not reading is always the best, but I'm surprised reading a 25kb page index takes 100ms. Is that expected?
   
   Sorry, i could not see it cost 100ms, could you show me out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org