You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/09 04:43:00 UTC
[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3769: Add benchmarks for testing row filtering
Ted-Jiang commented on PR #3769:
URL: https://github.com/apache/arrow-datafusion/pull/3769#issuecomment-1272454236
@thinkharderdev thanks for your great bench.
I run parquet tools in local get (1.0 GB)
```
(venv) yangjiang@LM-SHC-15009782 data % parquet-tools column-index ./logs.parquet
row group 0:
column index for column service:
Boudary order: UNORDERED
null count min max
page-0 0 backend frontend
offset index for column service:
offset compressed size first row index
page-0 62 117 0
column index for column host:
Boudary order: UNORDERED
null count min max
page-0 0 i-1ec3ca3151468928.ec2.internal i-1ec408f54dbd3750.ec2.internal
offset index for column host:
offset compressed size first row index
page-0 566 125 0
column index for column pod:
Boudary order: UNORDERED
null count min max
page-0 0 aejowuublavflbbsvlfozigwpmrxldvhaollk zxxlzhdrucrhpicpdgxtfpyuknvviimggtq
offset index for column pod:
offset compressed size first row index
page-0 6689 602 0
column index for column container:
Boudary order: UNORDERED
null count min max
page-0 0 backend_container_0 frontend_container_1
offset index for column container:
offset compressed size first row index
page-0 7602 593 0
```
There are at most two pages in one col, I think if we adjust to get more pages in one col (like reduce the page size), it will get greater performance in enable `enable_page_index `, we can get more opportunitys to skip whole pages without decoding! 🤔
FYI, i see impala choose to use fixed row number in one page to do benchmark for getting good performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org