You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/25 10:16:22 UTC

[GitHub] [arrow-datafusion] crepererum opened a new issue, #4370: Rewrite simple regex expressions

crepererum opened a new issue, #4370:
URL: https://github.com/apache/arrow-datafusion/issues/4370

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   In InfluxDB IOx, we have some users that query the data with simple regex expressions that don't really need a regex but (I guess) regexes are used for convenience or technical reasons (e.g. auto-generated expressions). For "regex match" and "regex not match", we have the following cases:
   
   | Case     | Example           | Description | Logical Rewrite (for "match")  |
   | -------- | ----------------- | ----------- | ------------------------------ |
   | Empty    | `''`              | Match all   | `col IS NOT NULL`              |
   | OR-chain | `'foo\|bar\|baz'` | Any of      | `(col = 'foo') OR (col = 'bar') OR (col = 'baz')`<br><br>`col IN ('foo', 'bar', 'baz')` |
   
   Now the fact that they are expressed as regex instead of a simple rewritten form has a bunch of performance consequences. These regex predicates are NOT considered for pruning (because how would you prune an arbitrary regex):
   
   https://github.com/apache/arrow-datafusion/blob/e1204a5bf72c119123404463befb716adbdcff25/datafusion/core/src/physical_optimizer/pruning.rs#L818-L871
   
   Finally they are NOT pushed down into `ParquetExec`. 
   
   **Describe the solution you'd like**
   Transform simple regex expressions into their equivalent logical expression.
   
   **Describe alternatives you've considered**
   Extend the pruning expression framework and `ParquetExec` to handle regexes. However this seems unnecessary complex and maybe even counterproductive, since regexes per se can be really expensive+complex to evaluate.
   
   **Additional context**
   \-
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #4370: Rewrite simple regex expressions

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #4370:
URL: https://github.com/apache/arrow-datafusion/issues/4370#issuecomment-1327359662

   This sounds good to me, certain expressions could also potentially be rewritten into `LIKE` expressions.
   
   FWIW I would rewrite 'foo|bar|baz' to `col IN ('foo', 'bar', 'baz')` as we already have an expression rewriter that can rewrite small `IN` into disjunctive expressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #4370: Rewrite simple regex expressions

Posted by GitBox <gi...@apache.org>.
alamb closed issue #4370: Rewrite simple regex expressions
URL: https://github.com/apache/arrow-datafusion/issues/4370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org