You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/09 04:43:00 UTC

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3769: Add benchmarks for testing row filtering

Ted-Jiang commented on PR #3769:
URL: https://github.com/apache/arrow-datafusion/pull/3769#issuecomment-1272454236

   @thinkharderdev thanks  for your great bench.
   I run parquet tools in local get (1.0 GB)
   ```
   (venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index  ./logs.parquet                                                     
   row group 0:
   column index for column service:
   Boudary order: UNORDERED
                         null count  min                                       max                                     
   page-0                         0  backend                                   frontend                                
   
   offset index for column service:
                             offset   compressed size       first row index
   page-0                        62               117                     0
   
   column index for column host:
   Boudary order: UNORDERED
                         null count  min                                       max                                     
   page-0                         0  i-1ec3ca3151468928.ec2.internal           i-1ec408f54dbd3750.ec2.internal         
   
   offset index for column host:
                             offset   compressed size       first row index
   page-0                       566               125                     0
   
   column index for column pod:
   Boudary order: UNORDERED
                         null count  min                                       max                                     
   page-0                         0  aejowuublavflbbsvlfozigwpmrxldvhaollk     zxxlzhdrucrhpicpdgxtfpyuknvviimggtq     
   
   offset index for column pod:
                             offset   compressed size       first row index
   page-0                      6689               602                     0
   
   column index for column container:
   Boudary order: UNORDERED
                         null count  min                                       max                                     
   page-0                         0  backend_container_0                       frontend_container_1                    
   
   offset index for column container:
                             offset   compressed size       first row index
   page-0                      7602               593                     0
   ```
   There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in  enable `enable_page_index `,  we can get more opportunitys to skip whole pages without decoding! 🤔
   
   FYI,  i see impala choose to use fixed row number in one page to do benchmark for getting good performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org