You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/11/04 09:56:38 UTC

[I] Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

alamb opened a new issue, #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051

   ### Is your feature request related to a problem or challenge?
   
   Currently, scalar UDF functions can not be "specialized"
   
   ```select
   SELECT * FROM t where my_matcher(t.column, '[a-z].*');
   ```
   
   What happens is that the regexp string `'[a-z].*'` gets passed as a literal expression to each invocation of the function, then you have to compile the regex again and again for every batch you process through the UDF. 
   
   Since it is expensive to compile these RegExps, it would be nice if there was something that could compile the RegExp once per plan rather than once per batch
   
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   Suggested by @thinkharderdev  on https://github.com/apache/arrow-datafusion/issues/8045#issuecomment-1793267211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1793400870

   One way to achieve this might be a [PhysicalOptimizerRule](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/optimizer/trait.PhysicalOptimizerRule.html#) that replaces relevant instances of [ScalarFunctionExpr](https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.ScalarFunctionExpr.html#). 
   
   However this is likely somewhat awkward to write and is not clear that these expressions would serialize well (as serialization matches a name to an expr). It is probably possible to do this during derialization with the [PhysicalExtensionCodec](https://docs.rs/datafusion-proto/latest/datafusion_proto/physical_plan/trait.PhysicalExtensionCodec.html#) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1979747257

   BTW now that @jayzhan211  and I  have implemented `ScalarUDF::simplify` in https://github.com/apache/arrow-datafusion/pull/9298 and we have ported the regular_expression functions to use `ScalarUDF`,  I think we could actually use that API to implement precompiled functions
   
   Note sure if that would meet your requirements @thinkharderdev 
   
   For example, to implement "precompiled regexp functions" we could do something like this (would be sweet if someone wanted to prototype this): 
   
   ```rust
   /// A new UDF that has a precompiled pattern
   impl PrecompiledRegexpReplace {
     precompiled_match: Arc<Pattern>
   }
   
   impl ScalarUDFImpl for PrecompiledRegexpReplace  {
      // invoke function uses `self.precompiled_match` directly
   ...
   }
   
   
   // Update the existing RegexpReplace function to implement `simplify`
   impl ScalarUDFImpl for RegexpReplace  {
   
     /// if the pattern argument is a scalar, rewrite the function to a new scalar UDF that
     /// contains a pre-compiled regular-expression
     fn simplify(&self) .. { 
       match (args[1], args[2]) {
          (ScalarValue::Utf8(pattern), ScalarValue::Utf8(flags)) => {
            let pattern = // create regexp match
            SImplified::Rewritten(ScalarUdf::new(PrecompiledRegexpMatch { precompiled } )))
             .call(args)
          }, 
         _ => Simplified::Original(args)
     }
   }
   ```
   
   We could then run some gnarly regular expression case, such as what is found on https://github.com/apache/arrow-datafusion/issues/8492 and see if it helps or not.
   
   If it doesn't help performance, then the extra complexity isn't worth it for regexp_replace
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "thinkharderdev (via GitHub)" <gi...@apache.org>.
thinkharderdev commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1981708580

   > Note sure if that would meet your requirements @thinkharderdev
   
   Yeah, although I was mainly using regexes as an example originally. Our particular issues are with some UDFs we have implemented that require some hilarious hacks to pass state as expressions. 
   
   But the idea does seem good to me.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1865126399

   BTW my hope it to prototype how this would work (as a ScalarUDF) by building on top of https://github.com/apache/arrow-datafusion/pull/8578
   
   It would be pretty rad


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "sadboy (via GitHub)" <gi...@apache.org>.
sadboy commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1863689720

   > the scalar function implementations that are passed around may be shared -- so using interior mutability would get confusing as each function invocation would be from the same instance of the function
   
   Hmm, not sure I'm following? I was thinking of the `ScalarFunctionExpr` physical node (https://docs.rs/datafusion-physical-expr/34.0.0/src/datafusion_physical_expr/scalar_function.rs.html#51), e.g. adding something like
   ```
   pub struct ScalarFunctionExpr {
   [...]
       args: Vec<Arc<dyn PhysicalExpr>>,
       prepared_args: OnceCell<Vec<Box<dyn PreparedArgs>>>,
   [...]
   }
   ```
   where `prepared_args` is populated by the return value of calling `ScalarFunction::prepare` on the set of physical `args`. The function implementation objects themselves do not need to mutate -- they only need to provide an optional implementation of `ScalarFunction::prepare` method if they wish to pre-process the arguments. `prepared_args` is purely an optimization construct, it should have no effect on the semantics of `ScalarFunctionExpr` -- the node should compute the exact same result whether or not `prepared_args` is populated. This way, any part of the system (e.g. `Serialize`, `Clone`, etc.) is always free to simply drop it without affecting correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1798636241

   @2010YOUY01  noted that in order to use the same APIs for built in functions and ScalarUDFs we will need to have some way to handle:
   
   > Figure out the interface to let core pass information to function definitions (e.g. now() requires to be passed query start time from core)
   
   This sounds very similar to this method
   
   Maybe we could add a `prepare(ctx: &TaskContext)` type method on ScalarUDF 🤔  that can potentially return a new version of the function to invoke 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1861674787

   > > That seems a bit heavy handed? Why not simply augment the scalar function expr node with an interior mutable "cache" cell, that is invisible to serialization? The cell can be automatically populated as needed by the function implementation upon node instantiation/deserialization.
   > 
   > @sadboy that is a (really) good idea
   
   BTW I thought more about this and one challenge is that currently the scalar function implementations that are passed around may be shared -- so using interior mutability would get confusing as each function invocation would be from the same instance of the function


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1857766870

   > That seems a bit heavy handed? Why not simply augment the scalar function expr node with an interior mutable "cache" cell, that is invisible to serialization? The cell can be automatically populated as needed by the function implementation upon node instantiation/deserialization.
   
   
   @sadboy  that is a (really) good idea
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1855953090

   I think the trick here will be to ensure we can still serialize such "precompiled" functions
   
   What I was thinking was maybe we can make a new `PhysicalExpr` that is like
   
   ```rust
   /// A function that is "precompiled" in some way 
   /// for example, for a regular expression that has a constant argument
   /// constant can be pre-compiled into a Regexpr match instance once per query
   /// rather than once per batch
   ///
   /// Somtimes precompiling make it hard/impossible to serialize the function again (e.g. the prepared regular expressions)
   /// so this structure contains the original PhysicalExpr that can be used to serialize the function
   struct PrecompiledExpr {
     precompiled: Arc<dyn PhysicalExpr>,
     original: Arc<dyn PhysicalExpr>
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Specialized / Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "sadboy (via GitHub)" <gi...@apache.org>.
sadboy commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1857273405

   That seems a bit heavy handed? Why not simply augment the scalar function expr node with an interior mutable "cache" cell, that is invisible to serialization? The cell can be automatically populated as needed by the function implementation upon node instantiation/deserialization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Pre-compiled / Prepared ScalarUDFs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8051:
URL: https://github.com/apache/arrow-datafusion/issues/8051#issuecomment-1793399666

   @thinkharderdev  suggests
   
   it would be useful to be able to serialize constant parameters into the user-defined scalar function themselves rather than pass them in as expressions. So for instance if you had to create a UDF to do something with a regex that you have as a static constant. Currently the way to do that is pass it as a literal expression. But then you have to compile the regex again for every batch you process through the UDF. Ideally you could have something like:
   > 
   > ```
   > struct MyRegexUdf {
   >   regex: Regex
   > }
   > 
   > impl ScalarFunction for MyRegexUdf {
   >   // use regex on each value somehow
   > }
   > ```
   > 
   > The regex would only need to be compiled once during deserialization (or construction) instead of once for each batch.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org