You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/28 12:14:07 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #1693: Expression Simplification for`Expr::Case` expressions

alamb opened a new issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   In certain situations , IOx is likely going to make predicates that look like the following
   
   ```sql
   CASE 
     WHEN col IS NULL THEN '' 
     ELSE col 
   END
   ```
   that basically map `null` to the empty string
   
   When applying them to certain specialized chunks (or row groups) we will know there are no NULLs in the `col` or all NULLs and thus we will end up rewriting it to something like
   
   ```sql
   CASE 
     WHEN true THEN '' 
     ELSE col 
   END
   ```
   
   Also in general, when applying other simplifications / constant folding I can imagine other situations where CASE can be folded such as 
   
   ```sql
   CASE
     WHEN extract(day) from now() = 0 THEN 'Monday'
     WHEN extract(day) from now() = 1 THEN 'Tuesday'
     WHEN extract(day) from now() = 2 THEN 'Wednesday'
     WHEN extract(day) from now() = 3 THEN 'Monday'
     ...
     ELSE 'other day'
   END
   ```
   
   So I think it is worth adding into DataFusion generally
   
   **Describe the solution you'd like**
   I would like to add cases to the rewrite rules here:
   https://github.com/apache/arrow-datafusion/blob/03075d5f4b3fdfd8f82144fcd409418832a4bf69/datafusion/src/optimizer/simplify_expressions.rs#L440-L454
   
   Some rules I can think of are:
   1. When there is a literal `true` in the `cases` preceded by 0 or 1 literal falses or nulls, use that.
   2. When all the cases are false, --> the otherwise
   
   **Additional context**
   See https://github.com/influxdata/influxdb_iox/pull/3557 for more details


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1024803959


   > When applying them to certain specialized chunks (or row groups) we will know there are no NULLs in the col or all NULLs and thus we will end up rewriting it to something like
   @alamb 
   Do you mean that you can get the no NULLS or all NULLS from the statistics of the chunk or row group of parquet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1025116217


   > Do you mean that you can get the no NULLS or all NULLS from the statistics of the chunk or row group of parquet?
   
   Yes.
   
   The usecase is that in one of our supported query languages (InfluxQL) a missing column is treated as though it were `''` rather than `null` and we add a case statement to do this mapping.
   
   However, the common case is the column is either always present or always absent (aka either all null or not null) in which case I would like to remove the mapping to avoid the overhead


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1025890407


   I think the boolean transformation as written is subtly wrong, I think I have the correct one but I'm just double-checking it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1025777506


   Another simplification that can be done that @tustvold  came up with is:
   
   ```
   CASE 
     WHEN X THEN A
     WHEN Y THEN B 
     ...
     ELSE Q
   END
   ```
   
   assuming the type of A and B are boolean, can be rewritten to
   
   ```
   (X AND A) OR (Y AND B) OR ... Q
   ```
   
   Which can then be subjected to the existing simplification AND / OR simplification rules


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 edited a comment on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

liukun4515 edited a comment on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1024803959


   > When applying them to certain specialized chunks (or row groups) we will know there are no NULLs in the col or all NULLs and thus we will end up rewriting it to something like
   
   @alamb 
   Do you mean that you can get the no NULLS or all NULLS from the statistics of the chunk or row group of parquet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #1693: Expression Simplification for`Expr::Case` expressions

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #1693:
URL: https://github.com/apache/arrow-datafusion/issues/1693#issuecomment-1025116217


   > Do you mean that you can get the no NULLS or all NULLS from the statistics of the chunk or row group of parquet?
   
   Yes.
   
   The usecase is that in one of our supported query languages (InfluxQL) a missing column is treated as though it were `''` rather than `null` and we add a case statement to do this mapping.
   
   However, the common case is the column is either always present or always absent (aka either all null or not null) in which case we can remove the mapping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org