You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/13 18:49:27 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #3828: Enable Parquet Row Filtering by default (WIP)

alamb opened a new pull request, #3828:
URL: https://github.com/apache/arrow-datafusion/pull/3828

   Draft until
   - [ ] https://github.com/apache/arrow-datafusion/pull/3822 is merged
   - [ ] We have completed testing / validation
   -
   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/3463
   re https://github.com/apache/arrow-datafusion/issues/3462
   
   
    # Rationale for this change
   This PR turns on parquet scan predicate pushdown (see https://github.com/apache/arrow-datafusion/issues/3462) by default -- I am putting it up early as part of the testing process (so we can work through any issues it may uncover)
   
   This feature promises to be one of the most significant performance improvements for DataFusion reading from parquet in a while. All the hard work was done by @Ted-Jiang @thinkharderdev  and @tustvold
   
   # What changes are included in this PR?
   Enable pushing filters into the scan directly
   
   Note this feature can be disabled by setting the `datafusion.execution.parquet.pushdown_filters` configuration setting to false. 
   
   # Are there any user-facing changes?
   Hopefully faster performance
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #3828:
URL: https://github.com/apache/arrow-datafusion/pull/3828#issuecomment-1328062248

   > Specifically made the parquet files like this:
   > 
   > ```
   > RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- convert --input ~/tpch_data/data_SF1 --output ~/tpch_data/parquet_data_SF1 --format=parquet
   > ```
   > 
   > And then ran
   > 
   > ```
   > RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096          
   > 
   >     Finished release [optimized] target(s) in 0.28s
   >      Running `target/release/tpch benchmark datafusion --iterations 3 --path /home/alamb/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096`
   > Running benchmarks with the following options: DataFusionBenchmarkOpt { query: None, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/alamb/tpch_data/parquet_data_SF1", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
   > Query 1 iteration 0 took 1511.2 ms and returned 4 rows
   > Query 1 iteration 1 took 1372.2 ms and returned 4 rows
   > Query 1 iteration 2 took 1419.7 ms and returned 4 rows
   > Query 1 avg time: 1434.38 ms
   > thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:129:27
   > note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   > Error: ArrowError(ExternalError(ArrowError(ExternalError("Arrow error: External error: Execution error: Arrow error: External error: Arrow error: External error: Execution error: Arrow error: External error: Execution error: Join Error: task 218 panicked"))))
   > alamb@aal-dev:~/arrow-datafusion$ 
   > ```
   > 
   > FYI @Ted-Jiang -- haven't had a chance to file this as a ticket or look more carefully into it
   
   Thanks for testing this, i will try to figure it out tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)
URL: https://github.com/apache/arrow-datafusion/pull/3828


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #3828:
URL: https://github.com/apache/arrow-datafusion/pull/3828#issuecomment-1328212562

   @alamb i think it fixed by https://github.com/apache/arrow-datafusion/pull/4387
   run
   ```
   (venv) yangjiang@LM-SHC-15009782 benchmarks % OPT_PARQUET_ENABLE_PAGE_INDEX=true  cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch-parquet  --format parquet --batch-size 4096                 
   ```
   without error


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3828:
URL: https://github.com/apache/arrow-datafusion/pull/3828#issuecomment-1328038705

   Specifically  made the parquet files like this:
   ```
   RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- convert --input ~/tpch_data/data_SF1 --output ~/tpch_data/parquet_data_SF1 --format=parquet
   ```
   
   And then ran
   
   ```
   RUSTFLAGS="-C target-cpu=native" cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ~/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096          
   
       Finished release [optimized] target(s) in 0.28s
        Running `target/release/tpch benchmark datafusion --iterations 3 --path /home/alamb/tpch_data/parquet_data_SF1 --format parquet --batch-size 4096`
   Running benchmarks with the following options: DataFusionBenchmarkOpt { query: None, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/alamb/tpch_data/parquet_data_SF1", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
   Query 1 iteration 0 took 1511.2 ms and returned 4 rows
   Query 1 iteration 1 took 1372.2 ms and returned 4 rows
   Query 1 iteration 2 took 1419.7 ms and returned 4 rows
   Query 1 avg time: 1434.38 ms
   thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:129:27
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   Error: ArrowError(ExternalError(ArrowError(ExternalError("Arrow error: External error: Execution error: Arrow error: External error: Arrow error: External error: Execution error: Arrow error: External error: Execution error: Join Error: task 218 panicked"))))
   alamb@aal-dev:~/arrow-datafusion$ 
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #3828: Enable Parquet Row and Page Filtering by default (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #3828:
URL: https://github.com/apache/arrow-datafusion/pull/3828#issuecomment-1328038359

   A small update here is that when I ran the tpch benchmarks against the default parquet files created by the benchmark I did not see any improvement. Also, there was some sort of error with the page index code which I need to track down


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org