You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/13 16:56:29 UTC

[GitHub] [arrow-datafusion] b41sh opened a new pull request #719: Optimize min/max queries with table statistics

b41sh opened a new pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #640.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   Speeding up max and min queries that can utilize the availability of the table column statistics.
   
   On the TPC-H `lineitem` table we can see a 600x speedup:
   ```
   > CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION './benchmarks/parquet/lineitem/';
   > select max(l_orderkey) from X;
   +-----------------+
   | MAX(l_orderkey) |
   +-----------------+
   | 6000000         |
   +-----------------+
   1 row in set. Query took 0.004 seconds.
   > select min(l_quantity) from X;
   +-----------------+
   | MIN(l_quantity) |
   +-----------------+
   | 1               |
   +-----------------+
   1 row in set. Query took 0.003 seconds.
   ```
   Master:
   ```
   > CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION './benchmarks/parquet/lineitem/';
   > select max(l_orderkey) from X;
   +-----------------+
   | MAX(l_orderkey) |
   +-----------------+
   | 6000000         |
   +-----------------+
   1 row in set. Query took 2.589 seconds.
   > select min(l_quantity) from X;
   +-----------------+
   | MIN(l_quantity) |
   +-----------------+
   | 1               |
   +-----------------+
   1 row in set. Query took 2.146 seconds.
   ```
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-893232111


   arrow-rs 5.1 has been released, can we test it? @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-880777702


   > This looks really cool @b41sh -- thank you very much for the contribution. It is not all that often one gets a 600x speedup :)
   > 
   > The one thing I worry about / wonder about is "how do we ensure no one breaks this by accident as we refactor or change the code in the future"
   > 
   > Perhaps we could follow the model of https://github.com/apache/arrow-datafusion/blob/master/datafusion/tests/parquet_pruning.rs#L44 (or maybe just extend that test) by:
   > 
   > 1. Adding some statistics to the parquet scan about total row groups read  or rows read
   > 2. Run a query with min/max and validate that no actual row groups are read.
   > 
   > What do you think?
   
   hi, @alamb 
   Thanks for your review.
   I will add some tests for this case, I'm still working on it and will submit it later
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-887855915


   Looks like there are a few compile errors to sort out, but after that is fixed I think this should be ready to merge


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-893490077


   I've started the test run


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-888258100


   > we may have to wait until arrow-rs 5.1.0 is released
   
   FYI the planned to release date of arrow-rs 5.1.0 is this weekend / early next week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-894489404


   Thanks @b41sh super 😎


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-888258100






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on a change in pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on a change in pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#discussion_r677421508



##########
File path: datafusion/src/physical_plan/expressions/mod.rs
##########
@@ -61,7 +61,7 @@ pub use is_not_null::{is_not_null, IsNotNullExpr};
 pub use is_null::{is_null, IsNullExpr};
 pub use lead_lag::{lag, lead};
 pub use literal::{lit, Literal};
-pub use min_max::{Max, Min};
+pub use min_max::{Max, MaxAccumulator, Min, MinAccumulator};

Review comment:
       I did not consider this situation, it has been modified.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#discussion_r669933745



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -312,22 +431,47 @@ impl ParquetExec {
             if let Some(x) = &part.statistics.column_statistics {
                 let part_nulls: Vec<Option<usize>> =
                     x.iter().map(|c| c.null_count).collect();
-                has_null_counts = true;
+                has_statistics = true;
+
+                let part_max_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.max_value.clone()).collect();
+                let part_min_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.min_value.clone()).collect();
 
                 for &i in projection.iter() {
                     null_counts[i] = part_nulls[i].unwrap_or(0);
+                    if let Some(part_max_value) = part_max_values[i].clone() {
+                        if let Some(max_value) = &mut max_values[i] {
+                            let _ = max_value.update(&[part_max_value]);

Review comment:
       This doesn't seem right to me -- it seems like it ignores errors if the min or max values can't be updated (maybe due to overflow or data type mismatch). I think it might be safer if there is an error calling `update()` to reset the `min_values[i]` or `max_values[i]` back to None
   
   Otherwise if you have something like
   ```
   Row Group 1: min=1
   Row Group 2: min=MIN_INT <-- and this causes errors for some reason
   Row Group 3: min=3
   ```
   The code as written will return `min =1` for this column in the parquet file which may be incorrect. I think a more conservative approach here would be for this code to return `None` so we don't erroneously use these statistics
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-894304725


   Thanks again @b41sh  -- this is a pretty awesome feature


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-893655863


   I will plan to merge this PR once the CI tests pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-887950192


   > Looks like there are a few compile errors to sort out, but after that is fixed I think this should be ready to merge
   
   This is because the `has_min_max_set` of `arrow-rs 5.0.0` is a private function, we may have to wait until `arrow-rs 5.1.0` is released
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-887490790


   > I think this is looking great -- thank you @b41sh -- do you think this PR is ready for more review (it is still marked as a draft which is why I am asking)
   > 
   > cc @Dandandan this is a pretty cool one
   
   It's basically ok, please review the code and give me some advice, thanks. @alamb
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#issuecomment-893232111


   arrow-rs 5.1 has been released, can we test it? @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] b41sh commented on a change in pull request #719: Optimize min/max queries with table statistics

Posted by GitBox <gi...@apache.org>.

b41sh commented on a change in pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#discussion_r670558958



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -312,22 +431,47 @@ impl ParquetExec {
             if let Some(x) = &part.statistics.column_statistics {
                 let part_nulls: Vec<Option<usize>> =
                     x.iter().map(|c| c.null_count).collect();
-                has_null_counts = true;
+                has_statistics = true;
+
+                let part_max_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.max_value.clone()).collect();
+                let part_min_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.min_value.clone()).collect();
 
                 for &i in projection.iter() {
                     null_counts[i] = part_nulls[i].unwrap_or(0);
+                    if let Some(part_max_value) = part_max_values[i].clone() {
+                        if let Some(max_value) = &mut max_values[i] {
+                            let _ = max_value.update(&[part_max_value]);

Review comment:
       ok, I will modify it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org