You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/05 10:20:24 UTC

[GitHub] [arrow-datafusion] nevi-me opened a new pull request #508: add expr::like and expr::notlike to pruning logic

nevi-me opened a new pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508


   # Which issue does this PR close?
   
   Closes #507.
   
    # Rationale for this change
   
   Extending pruning to include string columns with `LIKE`
   
   # What changes are included in this PR?
   
   Checks if a `LIKE` and `NOT LIKE` condition don't start with `%`, and converts them into a `EQ` filter.
   
   # Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646248560



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));

Review comment:
       Thanks, fixed and changed tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r647307669



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,47 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    // Split the string to get the first part before '%'
+                    let split = string.split('%').next().unwrap().to_string();

Review comment:
       I initially wondered the same thing -- lol! But my conclusion was "no it won't panic"
   
   I made a quick  playground that shows this working https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=049f4c1640386ff99c3b5e07085e0889




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855220055


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#508](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1062d5c) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/a9d04ca570015be6c94b767fef079783efdd877c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a9d04ca) will **increase** coverage by `0.00%`.
   > The diff coverage is `80.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/508/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #508   +/-   ##
   =======================================
     Coverage   76.07%   76.08%           
   =======================================
     Files         155      155           
     Lines       26544    26593   +49     
   =======================================
   + Hits        20194    20233   +39     
   - Misses       6350     6360   +10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Ξ” | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.03% <80.00%> (-1.24%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a9d04ca...1062d5c](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855218249


   @alamb is it enough to add the `like` and `not like` where I added them? Not sure of where else I need to change.
   
   @Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).
   
   If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use `like` and `not like` that can be pruned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r645975980



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       What about patterns like `pat1%pat2` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855247835


   > However we don't support query 14/16/20 yet:
   
   Aw :( it would have been great to see what impact the small change has.
   
   Maybe @alamb will see better results in iOX given that their data would likely have patterns that could benefit from this pruning.
   
   There's also parquet column indices that were introduced in 2.5.0. I'd like to work on them, as that's where we'll see bigger read improvements on sorted data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646003565



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       you are right, that makes sense πŸ‘ , still needs to take care of escaping indeed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855260595


   @nevi-me 
   
   I did some checking with data from TPC-H:
   
   Before:
   
   ```
   CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
   select l_orderkey from T where l_comment like '1%';
   0 rows in set. Query took 150 milliseconds.
   > 
   select l_orderkey from T where l_comment like '{%';
   0 rows in set. Query took 143 milliseconds.
   ```
   
   After:
   ```
   CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
   select l_orderkey from T where l_comment like '1%';
   0 rows in set. Query took 148 milliseconds.
   > 
   select l_orderkey from T where l_comment like '{%';
   0 rows in set. Query took 43 milliseconds.
   ```
   
   So looks like πŸ‘


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-902895823


   Closing stale PRs to keep PR review list manageable. Please reopen if that is a mistake


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r647107983



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,47 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    // Split the string to get the first part before '%'
+                    let split = string.split('%').next().unwrap().to_string();

Review comment:
       won't this `unwrap` panic if the string does not contain any `%`? (if "like" always requires that, maybe we should throw an error instead?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646003565



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       you are right, that makes sense πŸ‘ 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r645991369



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       I think as we can evaluate the like expression anyway, it might be easier to support like / not like to the full extent instead of only "startswith".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r647122094



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,47 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    // Split the string to get the first part before '%'
+                    let split = string.split('%').next().unwrap().to_string();

Review comment:
       Like does not require `%`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855220055


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#508](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1062d5c) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/a9d04ca570015be6c94b767fef079783efdd877c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a9d04ca) will **increase** coverage by `0.00%`.
   > The diff coverage is `80.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/508/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #508   +/-   ##
   =======================================
     Coverage   76.07%   76.08%           
   =======================================
     Files         155      155           
     Lines       26544    26593   +49     
   =======================================
   + Hits        20194    20233   +39     
   - Misses       6350     6360   +10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Ξ” | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `91.03% <80.00%> (-1.24%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a9d04ca...1062d5c](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855260595


   @nevi-me 
   
   I did some checking with data from TPC-H:
   
   Before:
   
   ```
   CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
   select l_orderkey from T where l_comment like '1%';
   0 rows in set. Query took 150 milliseconds.
   > 
   select l_orderkey from T where l_comment like '{%';
   0 rows in set. Query took 143 milliseconds.
   ```
   
   ```
   CREATE EXTERNAL TABLE T STORED AS PARQUET LOCATION '../benchmarks/parquet/lineitem';
   select l_orderkey from T where l_comment like '1%';
   0 rows in set. Query took 148 milliseconds.
   > 
   select l_orderkey from T where l_comment like '{%';
   0 rows in set. Query took 43 milliseconds.
   ```
   
   So looks like πŸ‘


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646248751



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1095,6 +1133,60 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn row_group_predicate_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that is converted to a 'starts_with'
+        let expr = col("c1").like(lit("Banana%"));
+        let expected_expr =
+            "#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").like(lit("%Banana%"));
+        let expected_expr = "Boolean(true)";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").not().like(lit("Banana%"));

Review comment:
       This explains why the filter was negated, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r647316738



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,47 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    // Split the string to get the first part before '%'
+                    let split = string.split('%').next().unwrap().to_string();

Review comment:
       Yes, won't panic because String::split() will always return at least 1 result, the full string if there's nothing to spilt by




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r645996852



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       I only focused on expressions that don't start with `%`, under the assumption that they would be a `starts_with`. I don't think we can support anything other than a `starts_with` because we translate the queries to `min LtEq value && value LtEq max`.
   
   Or how would `LIKE '100\% %'` be evaluated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646112932



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1095,6 +1133,60 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn row_group_predicate_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that is converted to a 'starts_with'
+        let expr = col("c1").like(lit("Banana%"));
+        let expected_expr =
+            "#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").like(lit("%Banana%"));
+        let expected_expr = "Boolean(true)";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'

Review comment:
       ```suggestion
           // test NOT LIKE operator that can't be converted to a 'starts_with'
   ```

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));

Review comment:
       I am not sure if just removing `%` is correct:
   
   For example in  a pattern like `foo%bar` would be converted to `foobar` and when compared with a value of `fooaaabar` would be deemed "out of range" by this logic, even though it matches the original predicate `foo%bar`.
   
   
   If instead, for `foo%bar` we used `foo` (only use the string up to the first unescaped `%`) I think then the logic applies.

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -548,7 +549,7 @@ fn build_predicate_expression(
         // allow partial failure in predicate expression generation
         // this can still produce a useful predicate when multiple conditions are joined using AND
         Err(_) => {
-            return Ok(logical_plan::lit(true));
+            return Ok(unhandled);

Review comment:
       πŸ‘ thanks I forgot that

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1095,6 +1133,60 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn row_group_predicate_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that is converted to a 'starts_with'
+        let expr = col("c1").like(lit("Banana%"));
+        let expected_expr =
+            "#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").like(lit("%Banana%"));
+        let expected_expr = "Boolean(true)";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").not().like(lit("Banana%"));

Review comment:
       I think there is a difference between `!(a LIKE b)` and `a NOT LIKE b` -- so to test the `NOT LIKE` operator above this should be something like
   
   ```suggestion
           let expr = col("c1").not_like(lit("Banana%");
   ```
   
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/logical_plan/expr.rs#L455-L457

##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -1095,6 +1133,60 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn row_group_predicate_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that is converted to a 'starts_with'
+        let expr = col("c1").like(lit("Banana%"));
+        let expected_expr =
+            "#c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").like(lit("%Banana%"));
+        let expected_expr = "Boolean(true)";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_not_starts_with() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'
+        let expr = col("c1").not().like(lit("Banana%"));
+        let expected_expr =
+            "NOT #c1_min LtEq Utf8(\"Banana\") And Utf8(\"Banana\") LtEq NOT #c1_max";
+        let predicate_expr =
+            build_predicate_expression(&expr, &schema, &mut RequiredStatColumns::new())?;
+        assert_eq!(format!("{:?}", predicate_expr), expected_expr);
+
+        Ok(())
+    }
+
+    #[test]
+    fn row_group_predicate_not_like() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("c1", DataType::Utf8, true)]);
+        // test LIKE operator that can't be converted to a 'starts_with'

Review comment:
       ```suggestion
           // test NOT LIKE operator that can't be converted to a 'starts_with'
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855245221


   > @alamb is it enough to add the `like` and `not like` where I added them? Not sure of where else I need to change.
   > 
   > @Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).
   > 
   > If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use `like` and `not like` that can be pruned.
   
   Ah I see, it has some patterns that don't match any value at all... However we don't support query 14/16/20 yet:
   
   https://github.com/apache/arrow-datafusion/issues/165
   https://github.com/apache/arrow-datafusion/issues/167
   https://github.com/apache/arrow-datafusion/issues/171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855220055


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#508](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1ee63dd) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/a9d04ca570015be6c94b767fef079783efdd877c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a9d04ca) will **increase** coverage by `0.06%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/508/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #508      +/-   ##
   ==========================================
   + Coverage   76.07%   76.14%   +0.06%     
   ==========================================
     Files         155      155              
     Lines       26544    26626      +82     
   ==========================================
   + Hits        20194    20274      +80     
   - Misses       6350     6352       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Ξ” | |
   |---|---|---|
   | [datafusion/src/physical\_optimizer/pruning.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfb3B0aW1pemVyL3BydW5pbmcucnM=) | `93.05% <100.00%> (+0.78%)` | :arrow_up: |
   | [datafusion/src/optimizer/constant\_folding.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvb3B0aW1pemVyL2NvbnN0YW50X2ZvbGRpbmcucnM=) | `91.31% <0.00%> (-0.38%)` | :arrow_down: |
   | [datafusion-cli/src/lib.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ΓΈ)` | |
   | [datafusion-cli/src/main.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi1jbGkvc3JjL21haW4ucnM=) | `0.00% <0.00%> (ΓΈ)` | |
   | [datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/508/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvbG9naWNhbF9wbGFuL2V4cHIucnM=) | `84.96% <0.00%> (+0.36%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a9d04ca...1ee63dd](https://codecov.io/gh/apache/arrow-datafusion/pull/508?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855220573


   > @alamb is it enough to add the `like` and `not like` where I added them? Not sure of where else I need to change.
   > 
   > @Dandandan I'm unable to configure the TPC benchmark data (rather, converting the files to Parquet).
   > 
   > If you don't mind, may you please check if Q{14|16|20} perform any better with this change? They use `like` and `not like` that can be pruned.
   
   I can try!
   In my experience / to my knowlegde pruning only matters on sorted / bucketed / colocated data. The data generated by the TPC-H benchmark is very well distributed by default without doing some kind of sorting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r645988035



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       We should also consider escaped percent characters in the pattern. Example: `LIKE '100\% %'`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#discussion_r646003565



##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -586,8 +587,45 @@ fn build_predicate_expression(
                 .min_column_expr()?
                 .lt_eq(expr_builder.scalar_expr().clone())
         }
+        Operator::Like => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>
+                {
+                    let scalar_expr =
+                        Expr::Literal(ScalarValue::Utf8(Some(string.replace('%', ""))));
+                    // Behaves like Eq
+                    let min_column_expr = expr_builder.min_column_expr()?;
+                    let max_column_expr = expr_builder.max_column_expr()?;
+                    min_column_expr
+                        .lt_eq(scalar_expr.clone())
+                        .and(scalar_expr.lt_eq(max_column_expr))
+                }
+                _ => unhandled,
+            }
+        }
+        Operator::NotLike => {
+            match &**right {
+                // If the literal is a 'starts_with'
+                Expr::Literal(ScalarValue::Utf8(Some(string)))
+                    if !string.starts_with('%') =>

Review comment:
       you are right, that makes sense πŸ‘  (I think escaping might be an issue in arrow-rs too)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-869141836


   Marking as draft so it is clearer from the list of PRs that there is planned work for this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-855257881


   Thanks @nevi-me  -- this looks great. I agree iOX may very well benefit from this as regex are common in our query workload. I will try and review this PR carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nevi-me commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-856652260


   > @nevi-me do you also want to address escaping the percentage character? `\%`
   > 
   > I know `like_utf8` is broken in Arrow but it might be confusing to introduce this error at different parts.
   > 
   > `\%` should just match the literal `%` character.
   > E.g. `nevi-\%x%` should use `nevi-%x` as start, not `nevi-\` as is the case currently.
   
   @alamb @jorgecarleitao please don't merge this yet, so I can address the above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #508: add expr::like and expr::notlike to pruning logic

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #508:
URL: https://github.com/apache/arrow-datafusion/pull/508#issuecomment-856459329


   @nevi-me do you also want to address escaping the percentage character? `\%` 
   
   I know `like_utf8` is broken in Arrow but it might be confusing to introduce this error at different parts.
   
   `\%`  should just match the literal `%` character.
   E.g. `nevi-\%x%` should use `nevi-%x` as start, not `nevi-\` as is the case currently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org