You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sarahyurick (via GitHub)" <gi...@apache.org> on 2023/03/13 23:21:57 UTC

[GitHub] [arrow-datafusion] sarahyurick opened a new issue, #5583: How to modify `OptimizerRule` fields?

sarahyurick opened a new issue, #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583

   Hi, I was wondering if it's currently possible to change or update `OptimizerRule` fields within the `try_optimize` function? Below, I've detailed the issues I'm running into and some things I've tried that didn't work.
   
   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I'm currently working on an optimizer rule for the Dask-SQL project, but I'm running into some annoying logic with DataFusion's optimizer. Basically, I'd like to be able to constantly update a field of the new rule's struct within the `try_optimize` function, but because it takes in a reference to the struct instead of the struct itself ([source code](https://github.com/apache/arrow-datafusion/blob/d0bd28eef34ca001d5e43d2732761a1b4cf09c71/datafusion/optimizer/src/optimizer.rs#L53)), the field can't be updated. To give a high-level example, let's say I have an optimizer rule:
   
   ```
   pub struct MyRule {
       my_set: HashSet<Expr>,
   }
   ```
   
   It would be nice to be able to use `ApplyOrder::TopDown` with it, but as I'm going down the `LogicalPlan` I want to insert any filters from any `Join` in the plan to `my_set`, so that this information can later be used to update a `TableScan`. However, within the `try_optimize` function, doing something like `self.my_set.insert(...)` isn't valid because it's a `&MyRule` and not a `MyRule`.
   
   **Describe the solution you'd like**
   I'm not sure if there's already a way to modify these fields when desired, but from my experimentation I couldn't get it to work. One solution I imagine is to change [`&self`](https://github.com/apache/arrow-datafusion/blob/d0bd28eef34ca001d5e43d2732761a1b4cf09c71/datafusion/optimizer/src/optimizer.rs#L61) to `self`, although I'm not sure if that's desirable.
   
   **Describe alternatives you've considered**
   I also tried within the `MyRule` implementation to create new instances of `MyRule` with logic like:
   
   ```
   fn try_optimize(
           &self,
           plan: &LogicalPlan,
           _config: &dyn OptimizerConfig,
       ) -> Result<Option<LogicalPlan>> {
           let my_expr = get_expr_func(&plan);
           let mut new_set = self.my_set.clone();
           new_set.insert(my_expr);
           let new_self = MyRule { my_set: new_set };
           // Some type of conditions, extra logic, etc. for when to recurse with try_optimize
           &new_self.try_optimize(plan, _config)
   }
   ```
   
   but was seeing some wonky logic where the updates would only carry to the next step and not through the rest of the steps down the plan. However, a similar solution could be possible.
   
   I also tried something like:
   
   ```
   impl MyRule {
       pub fn new() -> Self {
           Self { my_set: HashSet::new() }
       }
       
       pub fn set_myset(mut self, value: HashSet<Expr>) {
           self.my_set = value;
       }
   }
   
   impl OptimizerRule for MyRule {
       fn name(&self) -> &str {
           "my_rule"
       }
   
       fn apply_order(&self) -> Option<ApplyOrder> {
           Some(ApplyOrder::TopDown)
       }
   
       fn try_optimize(
           &self,
           plan: &LogicalPlan,
           _config: &dyn OptimizerConfig,
       ) -> Result<Option<LogicalPlan>> {
           let new_set = HashSet::from([...]);
           self.set_myset(new_set);
       }
   }
   ```
   
   which didn't work either.
   
   **Additional context**
   This isn't blocking any of my work on the Dask-SQL side; it's just that currently, the workaround I have to use requires a lot more lines of code than I would imagine should be needed.
   
   @andygrove @jdye64 I wanted to know your thoughts on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1469908435

   > I'm not familiar with the more recent changes in this space, but I would have expected the OptimizerRule to walk the tree itself to perform this rewrite, similar to how PhysicalOptimizerRule such as EnforceDistribution perform nested rewrites
   
   Yes, I think that is effectively what @jackwener did


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1469832233

   FYI @jackwener  I think recently did some work in this area for avoiding optimizer state -- they maybe have some ideas that can help


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sarahyurick commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "sarahyurick (via GitHub)" <gi...@apache.org>.
sarahyurick commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1470591148

   Ah okay, if it's preferred that the fields remain immutable, then I'll just have to do it a different way. Thanks for the responses!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sarahyurick commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "sarahyurick (via GitHub)" <gi...@apache.org>.
sarahyurick commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1467123805

   If it would be helpful, I can also provide a full Rust file example/create a Dask-SQL branch to further demonstrate the trouble I'm having. A simple example I've been working with is the case where we just have a `counter` numeric field and are trying to increment the counter by 1 between steps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1469910129

   Specifcally the tickets linked to https://github.com/apache/arrow-datafusion/issues/4267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jdye64 commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "jdye64 (via GitHub)" <gi...@apache.org>.
jdye64 commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1468193771

   Not an expert on this but it seems like what we would really want here is a "mutable reference for self" aka `&mut self` for the `OptimizerRule::try_optimize` signature. I think using just a `self` in the signature would sort of invalidate the Rust paradigm of only using that signature when the intent is for the struct to be "consumed" and no longer present after the function has finished execution. Much like how "into_*" functions are meant to be used and why they have that `self` in the signature.
   
   Also changing that signature to a `&mut self` should cause less of an impact on existing optimizers both in DataFusion and that others might have in their codebase. Using this approach it would make sense to expose functions that "act upon" the `MyRule::my_set` field but don't necessarily update the entire field itself. For example using a function like `set_insert(&mut self, expr: &Expr)` instead of `set_myset(mut self, value: HashSet<Expr>)`. Something like this ....
   
   
   ```
   impl MyRule {
       pub fn new() -> Self {
           Self { my_set: HashSet::new() }
       }
       
       // Don't use this, left for reference.
       pub fn set_myset(mut self, value: HashSet<Expr>) {
           self.my_set = value;
       }
   
       // Instead using functions like this to access or modify the field in the mutable self reference.
       pub fn set_insert(&mut self, expr: &Expr) {
           self.my_set.insert(expr.clone());
       }
   }
   
   impl OptimizerRule for MyRule {
       fn name(&self) -> &str {
           "my_rule"
       }
   
       fn apply_order(&self) -> Option<ApplyOrder> {
           Some(ApplyOrder::TopDown)
       }
   
       // Notice the change from `&self` to `&mut self` here
       fn try_optimize(
           &mut self,
           plan: &LogicalPlan,
           _config: &dyn OptimizerConfig,
       ) -> Result<Option<LogicalPlan>> {
           // This way each `try_optimize` invocation will receive a reference to the same mutable self and can continuously "insert" elements into it
           self.set_insert(&col("test"));
       }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1469847486

   This might be a naive question, but shouldn't OptimizerRules be pure, i.e. side-effect free. I think taking `&mut self` not only runs the risk of making OptimizerRule sensitive but also lead to unbounded memory usage - the `OptimizerRule` are created once per `SessionContext`.
   
   I'm not familiar with the more recent changes in this space, but I would have expected the `OptimizerRule` to walk the tree itself to perform this rewrite, similar to how `PhysicalOptimizerRule` such as `EnforceDistribution` perform nested rewrites


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sarahyurick commented on issue #5583: How to modify `OptimizerRule` fields?

Posted by "sarahyurick (via GitHub)" <gi...@apache.org>.
sarahyurick commented on issue #5583:
URL: https://github.com/apache/arrow-datafusion/issues/5583#issuecomment-1468882836

   Thanks @jdye64 ! Using `&mut self` makes sense to me. I've started working toward a solution in https://github.com/apache/arrow-datafusion/pull/5605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] sarahyurick closed issue #5583: How to modify `OptimizerRule` fields?

Posted by "sarahyurick (via GitHub)" <gi...@apache.org>.
sarahyurick closed issue #5583: How to modify `OptimizerRule` fields?
URL: https://github.com/apache/arrow-datafusion/issues/5583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org