You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/14 17:55:47 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #340: StatefulFunctions

alamb opened a new issue #340:
URL: https://github.com/apache/arrow-datafusion/issues/340


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   On a PR that added what postgres would term a `stable` function (something that is not the same from transaction to transaction, but something that not a function of its inputs either), namely `now()`, @jorgecarleitao suggested adding a concept of a `StatefulFunction` to use for functions that needed state, unlike `ScalarFunction` which is designed to not have state. 
   
   There is a lot of discussion on https://github.com/apache/arrow-datafusion/pull/288#issuecomment-839705580 and I will try to summarize a bunch of that; 
   
   @jorgecarleitao :
   
   > AFAIK current_* are all derived from now; imo the differentiator aspect here is that there is some state X that is being shared.
   >
   > It seems to me that the use-case here is that we want to preserve state across nodes, so that their execution depends on said state. NOW is an example, but in reality, random is also an example; we "cheated" a bit by not allowing users to select a seed. If they want that, we hit the same problem as NOW.
   >
   > IMO a natural construct here is something like struct StatefulFunction<T: Send + Sync>, where T is the state, and Arc<T> is inside of it, and that implements PhysicalExpr. During planning, the initial state is passed to it from the planner, and we are ready to fly.
   >
   > The ScalarFunction construct was meant to be stateless because it makes it very easy to develop, and it also makes it obvious that is stateless. Trying to couple execution state to them is imo going beyond its scope.
   
   @returnString 
   > In Postgres, this sort of corresponds to the function volatility categories (https://www.postgresql.org/docs/13/xfunc-volatility.html) which might be a useful basis for any future definition of different function types.
   >
   > immutable: pure function, can only use arguments and internal constants (example: basic math ops). Optimiser can do lots here
   > stable: can refer to shared state but must return the same value for the same arguments within a given statement (example: `now`). Optimiser is allowed to unify all references into one call per unique set of arguments 
   > volatile: no rules, no optimiser potential! Must always be evaluated exactly as initially planned (example: `random`)
   >
   > ...
   > Off the top of my head I think it'll open up some potential for generalised optimisation passes over function usage in queries according to function class, i.e. the optimiser rule used for the initial implementation of this PR but applicable to arbitrary functions provided they indicate themselves to be "stable".
   
   cc @returnString @jorgecarleitao @msathis @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #340: StatefulFunctions

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #340:
URL: https://github.com/apache/arrow-datafusion/issues/340#issuecomment-841408125


   My personal take is that adding some way to mark a `ScalarFunction` as being `immutable`, `stable` or `volatile` would be valuable for query optimization (e.g. we could inline/fold `immutable` functions in logical plans, inline/fold `stable` functions in physical plans, and never inline `volatile` functions)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org