You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/17 06:04:30 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new pull request, #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Ted-Jiang opened a new pull request, #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3833.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1319971745

   will add more test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1026585801


##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   All test case with expr fail 😭



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb merged PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1025451489


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,50 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        if let Ok(arr) = Decimal128Array::from_iter_values(
+                            vec.iter().map(|x| *x.$func().unwrap() as i128),
+                        )
+                        .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }
+                    }
+                    _ => {
+                        let vec = &index.indexes;
+                        Some(Arc::new(Int32Array::from_iter(
+                            vec.iter().map(|x| x.$func().cloned()),
+                        )))
+                    }
+                }
             }
             Index::INT64(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int64Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int64 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        if let Ok(arr) = Decimal128Array::from_iter_values(
+                            vec.iter().map(|x| *x.$func().unwrap() as i128),

Review Comment:
   I wonder if it would be better to follow the model of below and rather than unwrapping turn it into 'NULL' 
   
   something like 🤔 
   
   ```suggestion
                               vec.iter().map(|x| *x.$func().ok()).map(|v| v as i128),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1024787608


##########
datafusion/core/src/physical_plan/file_format/parquet.rs:
##########
@@ -654,6 +656,43 @@ pub async fn plan_to_parquet(
     }
 }
 
+// TODO: consolidate code with arrow-rs
+// Convert the bytes array to i128.
+// The endian of the input bytes array must be big-endian.
+// Copy from the arrow-rs
+pub(crate) fn from_bytes_to_i128(b: &[u8]) -> i128 {

Review Comment:
   Move the common func here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1026585150


##########
datafusion/core/tests/parquet/row_group_pruning.rs:
##########
@@ -503,465 +483,3 @@ async fn prune_decimal_in_list() {
     )
     .await;
 }
-

Review Comment:
   code move to mod.rs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
xudong963 commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1322873381

   I'll take a look the issue later :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028795409


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))

Review Comment:
   Your are right! i forgot change it writing the macro,   real surprise ut not cover this 😂



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028151323


##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   > I wonder if we have to run "type coercion / simplifiction" on them first?
   
   Did rowGroup run this "type coercion / simplifiction"  🤔 ? I think they are the same code path, i will find it out soon. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1024786913


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,50 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        if let Ok(arr) = Decimal128Array::from_iter_values(
+                            vec.iter().map(|x| *x.$func().unwrap() as i128),
+                        )
+                        .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }
+                    }
+                    _ => {
+                        let vec = &index.indexes;
+                        Some(Arc::new(Int32Array::from_iter(
+                            vec.iter().map(|x| x.$func().cloned()),
+                        )))
+                    }
+                }
             }
             Index::INT64(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int64Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int64 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        if let Ok(arr) = Decimal128Array::from_iter_values(
+                            vec.iter().map(|x| *x.$func().unwrap() as i128),

Review Comment:
   https://github.com/apache/arrow-rs/blob/d88ed6a4eb72eb2f87544f2811c23fd55dc706be/parquet/src/file/statistics.rs#L434
   
   i think its ok `unwrap` here, same as  row group pruning



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1025943519


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;

Review Comment:
   decimal should be supported for this logical type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028794578


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))
+                            .collect();
+                        if let Ok(arr) = Decimal128Array::from(vec)
+                            .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }

Review Comment:
   Nice suggestion ! Some much api need remember deal with `option` and `result` 😂



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1025945769


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;

Review Comment:
   Arrow-rs contains the method of decoding decimal from byte array in `ByteArrayReader`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1025941935


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -382,6 +394,9 @@ fn create_row_count_in_each_page(
 struct PagesPruningStatistics<'a> {
     col_page_indexes: &'a Index,
     col_offset_indexes: &'a Vec<PageLocation>,
+    // target_type means the logical type in schema: like 'DECIMAL' is the logical type, but the

Review Comment:
   👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028797172


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))

Review Comment:
   i will add a greater than test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028151323


##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   > I wonder if we have to run "type coercion / simplifiction" on them first?
   Did rowGroup run this "type coercion / simplifiction"  🤔? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1025458794


##########
test-utils/src/data_gen.rs:
##########
@@ -146,6 +148,7 @@ impl BatchBuilder {
             .append_option(rng.gen_bool(0.9).then(|| rng.gen()));
         self.response_status
             .append_value(status[rng.gen_range(0..status.len())]);
+        self.prices_status.append_value(self.row_count as i128);

Review Comment:
   the incrementing price makes sense for range testing



##########
datafusion/core/tests/parquet/filter_pushdown.rs:
##########
@@ -266,20 +266,17 @@ async fn single_file_small_data_pages() {
     // page 3:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: djzdyiecnumrsrcbizwlqzdhnpoiqdh, max: fktdcgtmzvoedpwhfevcvvrtaurzgex, num_nulls not defined] CRC:[none] SZ:7 VC:9216
     // page 4:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fktdcgtmzvoedpwhfevcvvrtaurzgex, max: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, num_nulls not defined] CRC:[none] SZ:7 VC:9216
     // page 5:                                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: fwtdpgtxwqkkgtgvthhwycrvjiizdifyp, max: iadnalqpdzthpifrvewossmpqibgtsuin, num_nulls not defined] CRC:[none] SZ:7 VC:7739
-    //

Review Comment:
   🎉 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1026202590


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;

Review Comment:
   Thanks, i prefer align with row group, do them together in other pr. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1027862911


##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))

Review Comment:
   I wonder if this this be `$x.$func()` rather than `x.min()`?



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))
+                            .collect();
+                        if let Ok(arr) = Decimal128Array::from(vec)
+                            .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }
+                    }
+                    _ => {
+                        let vec = &index.indexes;
+                        Some(Arc::new(Int32Array::from_iter(
+                            vec.iter().map(|x| x.$func().cloned()),
+                        )))
+                    }
+                }
             }
             Index::INT64(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int64Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int64 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))

Review Comment:
   same question here -- should this be `x.$func()` rather than `x.min()`?



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(

Review Comment:
   This is great coverage -- thanks @Ted-Jiang. It is somewhat repetitive with the row group pruning but I think that is ok as they are different code paths



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr_subtract() {
+    // result of sql "SELECT * FROM t where 1-f > 1" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where 1-f > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_eq_in_list() {
+    // result of sql "SELECT * FROM t where in (1)"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i in (1)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq_in_list_negated() {
+    // result of sql "SELECT * FROM t where not in (1)" prune nothing
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i not in (1)",
+        Some(0),
+        Some(0),
+        19,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_decimal_lt() {
+    // The data type of decimal_col is decimal(9,2)
+    // There are three pages each 5 rows:
+    // [1.00, 6.00], [-5.00,6.00], [20.00,60.00]
+    test_prune(
+        Scenario::Decimal,
+        "SELECT * FROM t where decimal_col < 4",
+        Some(0),
+        Some(5),
+        6,
+    )
+    .await;
+    // compare with the casted decimal value
+    test_prune(
+        Scenario::Decimal,
+        "SELECT * FROM t where decimal_col < cast(4.55 as decimal(20,2))",
+        Some(0),
+        Some(5),
+        8,
+    )
+    .await;
+
+    // The data type of decimal_col is decimal(38,2)
+    test_prune(
+        Scenario::DecimalLargePrecision,
+        "SELECT * FROM t where decimal_col < 4",
+        Some(0),
+        Some(5),
+        6,
+    )
+    .await;
+    // compare with the casted decimal value
+    test_prune(
+        Scenario::DecimalLargePrecision,
+        "SELECT * FROM t where decimal_col < cast(4.55 as decimal(20,2))",
+        Some(0),
+        Some(5),
+        8,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_decimal_eq() {

Review Comment:
   it might be worth another test that prunes something other than 5 rows -- maybe `where decimal_col = 30.00` and prunes out the other pages? All of the tests here seem to prune out only the third page 20.00 -> 60.00



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   I wonder if we have to run "type coercion / simplifiction" on them first?



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -419,10 +468,37 @@ macro_rules! get_min_max_values_for_page_index {
                     vec.iter().map(|x| x.$func().cloned()),
                 )))
             }
-            Index::INT96(_) | Index::BYTE_ARRAY(_) | Index::FIXED_LEN_BYTE_ARRAY(_) => {
+            Index::BYTE_ARRAY(index) => {
+                let vec = &index.indexes;

Review Comment:
   Addition additional support in a follow on PR sounds like a good idea to me -- maybe we can file a ticket to track the work



##########
datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs:
##########
@@ -390,16 +405,54 @@ macro_rules! get_min_max_values_for_page_index {
         match $self.col_page_indexes {
             Index::NONE => None,
             Index::INT32(index) => {
-                let vec = &index.indexes;
-                Some(Arc::new(Int32Array::from_iter(
-                    vec.iter().map(|x| x.$func().cloned()),
-                )))
+                match $self.target_type {
+                    // int32 to decimal with the precision and scale
+                    Some(DataType::Decimal128(precision, scale)) => {
+                        let vec = &index.indexes;
+                        let vec: Vec<Option<i128>> = vec
+                            .iter()
+                            .map(|x| x.min().and_then(|x| Some(*x as i128)))
+                            .collect();
+                        if let Ok(arr) = Decimal128Array::from(vec)
+                            .with_precision_and_scale(*precision, *scale)
+                        {
+                            return Some(Arc::new(arr));
+                        } else {
+                            return None;
+                        }

Review Comment:
   You might be able to this more functionally with something like (untested):
   ```suggestion
                           Decimal128Array::from(vec)
                              .with_precision_and_scale(*precision, *scale)
                             .ok()
                             .map(|arr| Arc::new(arr))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1323569207

   Benchmark runs are scheduled for baseline = eac254c0ef347c5bd12e8f6401a6fb0113090294 and contender = d7a7fb61afe9ce2824aae737f65aec12d9513f7f. d7a7fb61afe9ce2824aae737f65aec12d9513f7f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/aefe89c4b88542168e2b7d5ee2c8de17...d5d385f70d85427c90ccee1e2a2b086f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f696aaf719544b14adb83850cc7f207a...7131e9cd1f224d5eb9294c3d18c49223/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/af7bd04f30dc4024994776dface9f4b0...476f9b0fbe434695b6fd0acf9163aa0f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/27359e2fca4c41678a351078d51b0761...2ad30cb622b849c1a6e343f0d34f4994/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1323563907

   I think this is ready to go in now -- thank you @xudong963  and @Ted-Jiang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1319494081

   > 
   
   
   
   > Thank you @Ted-Jiang -- this looks great
   > 
   > I wonder if it would be possible to add some more targeted testing for the string and decimal page indexes in https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/parquet/page_pruning.rs
   > 
   > The current test in `parquet_exec` I think ensures that the plumbing is all hooked up correctly, but I think some more targeted testing would be good too
   > 
   > However, overall I think this PR could also go in as is. Thanks a lot!
   
   agree.
   
   @Ted-Jiang 
   We can add more test for this, and use different physical data type with different decimal data type(diff precision and scale are better)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1320187910

   @alamb  @liukun4515 Add all types check same as in `row_group`  for page-index pruning.
   
   [reorg test code](https://github.com/apache/arrow-datafusion/pull/4255/commits/5b6c4783d98fbe3f34f4cb4b305aece9ad1f3651):  code refactoring avoid duplicate code in test.
   [add test for page index](https://github.com/apache/arrow-datafusion/pull/4255/commits/4c81dca861935f66238e359d2e157a34910e0768): add same test case for page index.
   
   Some test are `ignore`, i think there are some bug with `complex_expr` 🤔 will fix in next pr(not have a clue now) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#issuecomment-1321116945

   I plan to review this carefully tomorrow again --sorry for the delay @Ted-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028828904


##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   related https://github.com/apache/arrow-datafusion/issues/4317



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   File related https://github.com/apache/arrow-datafusion/issues/4317



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #4255: Support parquet page filtering on min_max for `decimal128` and `string` columns

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #4255:
URL: https://github.com/apache/arrow-datafusion/pull/4255#discussion_r1028655581


##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -204,3 +222,466 @@ async fn page_index_filter_multi_col() {
     let batch = results.next().await.unwrap().unwrap();
     assert_eq!(batch.num_rows(), 7300);
 }
+
+async fn test_prune(
+    case_data_type: Scenario,
+    sql: &str,
+    expected_errors: Option<usize>,
+    expected_row_pages_pruned: Option<usize>,
+    expected_results: usize,
+) {
+    let output = ContextWithParquet::new(case_data_type, Page)
+        .await
+        .query(sql)
+        .await;
+
+    println!("{}", output.description());
+    assert_eq!(output.predicate_evaluation_errors(), expected_errors);
+    assert_eq!(output.row_pages_pruned(), expected_row_pages_pruned);
+    assert_eq!(
+        output.result_rows,
+        expected_results,
+        "{}",
+        output.description()
+    );
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000000             2020-01-02T01:01:01.000000000
+// page-1                         1  2020-01-01T01:01:11.000000000             2020-01-02T01:01:11.000000000
+// page-2                         1  2020-01-01T01:11:01.000000000             2020-01-02T01:11:01.000000000
+// page-3                         1  2020-01-11T01:01:01.000000000             2020-01-12T01:01:01.000000000
+async fn prune_timestamps_nanos() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where nanos < to_timestamp('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                         null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000000                2020-01-02T01:01:01.000000
+// page-1                         1  2020-01-01T01:01:11.000000                2020-01-02T01:01:11.000000
+// page-2                         1  2020-01-01T01:11:01.000000                2020-01-02T01:11:01.000000
+// page-3                         1  2020-01-11T01:01:01.000000                2020-01-12T01:01:01.000000
+async fn prune_timestamps_micros() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where micros < to_timestamp_micros('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01T01:01:01.000                   2020-01-02T01:01:01.000
+// page-1                         1  2020-01-01T01:01:11.000                   2020-01-02T01:01:11.000
+// page-2                         1  2020-01-01T01:11:01.000                   2020-01-02T01:11:01.000
+// page-3                         1  2020-01-11T01:01:01.000                   2020-01-12T01:01:01.000
+async fn prune_timestamps_millis() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where millis < to_timestamp_millis('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  1577840461                                1577926861
+// page-1                         1  1577840471                                1577926871
+// page-2                         1  1577841061                                1577927461
+// page-3                         1  1578704461                                1578790861
+
+async fn prune_timestamps_seconds() {
+    test_prune(
+        Scenario::Timestamps,
+        "SELECT * FROM t where seconds < to_timestamp_seconds('2020-01-02 01:01:11Z')",
+        Some(0),
+        Some(5),
+        10,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                       null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date32() {
+    test_prune(
+        Scenario::Dates,
+        "SELECT * FROM t where date32 < cast('2020-01-02' as date)",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         1  2020-01-01                                2020-01-04
+// page-1                         1  2020-01-11                                2020-01-14
+// page-2                         1  2020-10-27                                2020-10-30
+// page-3                         1  2029-11-09                                2029-11-12
+async fn prune_date64() {
+    // work around for not being able to cast Date32 to Date64 automatically
+    let date = "2020-01-02"
+        .parse::<chrono::NaiveDate>()
+        .unwrap()
+        .and_time(chrono::NaiveTime::from_hms_opt(0, 0, 0).unwrap());
+    let date = ScalarValue::Date64(Some(date.timestamp_millis()));
+
+    let output = ContextWithParquet::new(Scenario::Dates, Page)
+        .await
+        .query_with_expr(col("date64").lt(lit(date)))
+        .await;
+
+    println!("{}", output.description());
+    // This should prune out groups  without error
+    assert_eq!(output.predicate_evaluation_errors(), Some(0));
+    assert_eq!(output.row_pages_pruned(), Some(15));
+    assert_eq!(output.result_rows, 1, "{}", output.description());
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5                                        -1
+// page-1                         0  -4                                        0
+// page-2                         0  0                                         4
+// page-3                         0  5                                         9
+async fn prune_int32_lt() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    // result of sql "SELECT * FROM t where i < 1" is same as
+    // "SELECT * FROM t where -i > -1"
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where -i > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+async fn prune_int32_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun_and_eq() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1  and i = 1",
+        Some(0),
+        Some(15),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_scalar_fun() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where abs(i) = 1",
+        Some(0),
+        Some(0),
+        3,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where i+1 = 1",
+        Some(0),
+        Some(0),
+        2,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_int32_complex_expr_subtract() {
+    test_prune(
+        Scenario::Int32,
+        "SELECT * FROM t where 1-i > 1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+//                      null count  min                                       max
+// page-0                         0  -5.0                                      -1.0
+// page-1                         0  -4.0                                      0.0
+// page-2                         0  0.0                                       4.0
+// page-3                         0  5.0                                       9.0
+async fn prune_f64_lt() {
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f < 1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where -f > -1",
+        Some(0),
+        Some(5),
+        11,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun_and_gt() {
+    // result of sql "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1"
+    // only use "f >= 0" to prune
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f - 1) <= 0.000001  and f >= 0.1",
+        Some(0),
+        Some(2),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_scalar_fun() {
+    // result of sql "SELECT * FROM t where abs(f-1) <= 0.000001" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where abs(f-1) <= 0.000001",
+        Some(0),
+        Some(0),
+        1,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]
+async fn prune_f64_complex_expr() {
+    // result of sql "SELECT * FROM t where f+1 > 1.1"" is not supported
+    test_prune(
+        Scenario::Float64,
+        "SELECT * FROM t where f+1 > 1.1",
+        Some(0),
+        Some(0),
+        9,
+    )
+    .await;
+}
+
+#[tokio::test]
+#[ignore]

Review Comment:
   @Ted-Jiang Does the `test_prune` function not run the optimizer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org