You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "peter-toth (via GitHub)" <gi...@apache.org> on 2023/12/27 14:08:59 UTC

[I] Get rid of special TreeNodes [arrow-datafusion]

peter-toth opened a new issue, #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663

   ### Is your feature request related to a problem or challenge?
   
   Curretnly there are many special `TreeNode` trees in DataFusion so as to carry additional information needed for tree transformations. These special trees are a bit cumbersome as they need to implement `TreeNode` functions (`apply_children()`, `map_children()`).
   
   
   ### Describe the solution you'd like
   
   I'm suggesting to add `TreeNode.transform_with_payload()`, `TreeNode.transform_down_with_payload()` and `TreeNode.transform_up_with_payload()` to propagate up additional information during `TreeNode` transformation.
   
   Please see 4. in https://github.com/apache/arrow-datafusion/pull/7942 and its POC implementation.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-2058979982

   > Is this ticket still tracking anything I wonder (or is it now covered by #8913 🤔 )
   
   I believe https://github.com/apache/arrow-datafusion/pull/8891 had covered this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1870418298

   I checked `SortPushDown` and `ExprOrdering` when opened https://github.com/apache/arrow-datafusion/pull/8664 on the top of this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-2059418294

   Thanks again @berkaysynnada  and @peter-toth  -- the TreeNode API is looking quite good these days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #8663: Get rid of special TreeNodes
URL: https://github.com/apache/arrow-datafusion/issues/8663


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1871114495

   We have also observed the complex API and different implementation approaches of TreeNode. I'm not sure if this is the right place to share these but, let me share our findings and a simple POC of how we can have a new trait providing a simpler and more understandable API:
   
   Since the general approach in Datafusion tree structs involves having the children nodes within the self node, a pure `transform_down` cannot be performed. Even if the nodes are applied by the given `op()` in pre-order, the build order is always `bottom-up`. Inserting some logic into that build process (into `map_children()`) can introduce hidden `transform-up` functionality. For `transform-up()`s, the same flexibility may also cause the spread of a rule's logic into different parts of the code, going against code integrity.
   
   There are also some unused functionalities and outcome types in general, so I think that a more simple API is enough usually. I scribbled a little and came up with something like this, sharing it here in case it's useful.
   
   ```
   trait TreeNode_v2: Sized {
       fn children<'a>(&mut self) -> &'a mut [Self];
   
       fn transform<F>(&mut self, op_up: &mut F, op_down: &mut F) -> Result<()>
       where
           F: FnMut(&mut Self) -> Result<()>,
       {
           op_down(self)?;
   
           let children_as_mut_ref = self.children();
           if !children_as_mut_ref.is_empty() {
               children_as_mut_ref
                   .iter_mut()
                   .map(|c| c.transform(op_up, op_down))
                   .collect::<Result<_>>()?;
           }
   
           op_up(self)?;
           Ok(())
       }
   }
   ```
   
   `op_up` and `op_down` may also be separated into 2 different functions. But the key concept is that the user only must implement how the node reaches its children, and the corresponding rule logic to apply while traversing the tree. It is also important that the whole tree must be created before the traversals with default payloads. Before the refactor, there were some places making on-the-fly children construction during the traversal.
   
   I don’t know yet how this design can be integrated with the existing TreeNode, but in many places, such a design makes things easier and more understandable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1871218567

   I think think this is partly the same idea I was suggesting in 1. of https://github.com/apache/arrow-datafusion/pull/7942. The main point there was that the closures change from `FnMut(Self) -> Result<...>` to the self mutating `FnMut(&mut Self) -> Result<...>`.
   (Please note that the `TreeNodeTransformer` trait is not important in my PR. I just added there it to make `transform()` similar to the existing `rewrite()`. `rewrite()` uses a `TreeNodeRewriter` that has a pre-order (`pre_visit()`) and post-order like (`mutate()`) methods collected into one object, but I too would prefer separate `f_down` and `f_up` closures. Actually the new `transform_with_payload()` in https://github.com/apache/arrow-datafusion/pull/8664 follows that idea.)
   
   Where I see some differences is the return type of our closures. Your code snipett shows `Result<()>` while I was suggesting `Result<TreeNodeRecursion>`. This is because I wanted to deprecate the current `rewrite()` / `TreeNodeRewriter` and replace/combine that with a new `transform()` method. If we want to do that then the new closures need to be able to prune the tree traversal (skip a subtree) and I don't think we can do it with a simple `Result<()>`.
   (Actaully as the same pruning capability is needed for the `TreeNode.apply()` / `TreeNode.visit()` as well so I poposed to have a common `TreeNodeRecursion` enum for visit and transform functions. This is 2. in https://github.com/apache/arrow-datafusion/pull/7942)
   
   So basically here is the `transform()` I suggested:
   ```rust
   trait TreeNode_v2: Sized {
       fn transform_children<F>(&mut self, f: &mut F) -> Result<TreeNodeRecursion>
       where
           F: FnMut(&mut Self) -> Result<TreeNodeRecursion>;
   
       fn transform(&mut self, f_down: &mut F,f_up: &mut F) -> Result<TreeNodeRecursion>
       where 
           F: FnMut(&mut self) -> Result<TreeNodeRecursion>,
        {
           // Apply `f_down` on self.
           f_down(self)?
               // If it returns continue (not prune or stop or stop all) then continue
               // traversal on inner children and children.
               .and_then_on_continue(||
                   // Run the recursive `transform` on each children.
                   self
                       .transform_children(&mut |c| c.transform(f_down, f_up))?
                       // Apply `f_up` on new self.
                       .and_then_on_continue(|| f_up(self)))?
               // Applying `pre_transform` or `post_transform` on self might have returned
               // prune, but we need to propagate continue.
               .continue_on_prune()
       }
   }
   ```
   
   Regarding 
   
   > the key concept is that the user only must implement how the node reaches its children
   
   I totally aggree.
   But my question about `fn children<'a>(&mut self) -> &'a mut [Self];` is that can we return children as `&mut[Self]` without collecting them into a temporary collection like `Vec`? Currently `Expr.map_children()` doesn't collect the children into a temporary collection so if we start doing it then we might introduce some performance regression.
   BTW I was trying to follow a very similar approach initially. My initial `children()` returned an `impl Iterator<Item=&mut Self>` so as to avoid collecting childrens. But I simply couldn't implement it for `Expr` without returning an dynamic dispatch iterator so I remained at the above `transform_children()` that is similar to the current `map_children()`.
   
   Lastly, in this particulat issue I wanted to see if there is a way to get rid of those special trees. It seems unecessary to create those thees and I found it very hard to follow how additonal payload is propagated down/up. In https://github.com/apache/arrow-datafusion/pull/8664 I removed `SortPushDown` and `PlanWithRequitements` and `ExprOrdering` with the help of `transform_down_with_payload()` and `transform_up_with_payload()` functions and I think the new `f_down` / `f_up` closures are much simpler to understand now.
   But, I'm open for other proposals like @alamb's https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1870402672 although I'm not sure yet if capturing the state in the closures make things easier.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1871803326

   Thanks @peter-toth for your detailed explanation ☺️ 
   
   > Where I see some differences is the return type of our closures. Your code snipett shows `Result<()>` while I was suggesting `Result<TreeNodeRecursion>`. This is because I wanted to deprecate the current `rewrite()` / `TreeNodeRewriter` and replace/combine that with a new `transform()` method. If we want to do that then the new closures need to be able to prune the tree traversal (skip a subtree) and I don't think we can do it with a simple `Result<()>`. (Actaully as the same pruning capability is needed for the `TreeNode.apply()` / `TreeNode.visit()` as well so I poposed to have a common `TreeNodeRecursion` enum for visit and transform functions. This is 2. in #7942)
   
   I hadn't thought deeply about the return type, but your suggestion makes a lot of sense.
   
   > But my question about fn children<'a>(&mut self) -> &'a mut [Self]; is that can we return children as &mut[Self] without collecting them into a temporary collection like Vec? Currently Expr.map_children() doesn't collect the children into a temporary collection so if we start doing it then we might introduce some performance regression. 
   
   I wasn't aware of `Expr` specifically, but I considered this approach because collection is already performed in all rules on the physical plan side.
   
   I will review #8664 and write my new thoughts there. I believe we will find the most appropriate solution 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1870340696

   let me open a PR once https://github.com/apache/arrow-datafusion/pull/8653 landed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-1870402672

   > I'm suggesting to add TreeNode.transform_with_payload(), TreeNode.transform_down_with_payload() and TreeNode.transform_up_with_payload() to propagate up additional information during TreeNode transformation.
   
   Another common pattern in Rust is to capturing state in a closure rather than an explict payload -- perhaps that idea is also worth considering


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-2058973395

   Is this ticket still tracking anything I wonder (or is it now covered by #8913 🤔 )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Get rid of special TreeNodes [arrow-datafusion]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.
peter-toth commented on issue #8663:
URL: https://github.com/apache/arrow-datafusion/issues/8663#issuecomment-2058982084

   I think we can close this issue as https://github.com/apache/arrow-datafusion/pull/8817 reduced the number of `TreeNode` by introducing `PlanContext` and `ExprContext`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org