You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mustafasrepo (via GitHub)" <gi...@apache.org> on 2023/05/31 09:36:25 UTC

[GitHub] [arrow-datafusion] mustafasrepo opened a new pull request, #6501: Replace OrderedColumn with PhysicalSortExpr

mustafasrepo opened a new pull request, #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
   `OrderedColumn` struct keeps columns that have ordering, with ordering information. This struct is used during `OrderingEquivalence` calculations. However, existing `PhysicalSortExpr` can keep track of this information. Also `PhysicalSortExpr` supports not just, columns but complex expressions also. 
   
   We can use `PhysicalSortExpr` instead of `OrderedColumn`.
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   This PR removes `OrderedColumn` struct and uses `PhysicalSortExpr` in its place. 
   
   However, because `PhysicalSortExpr` doesn't implement `Hash` trait (there is no trivial way to support this trait if any). We changed the `EquivalentClass` implementation so that it doesn't require `Hash` trait anymore. 
   For this reason, we have replaced places in `EquivalentClass` where `HashSet` is used with `Vector`. 
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   Yes existing tests should work, also new test is added (under `window.slt` file) to show that we can use complex expressions (not just Columns) during ordering equivalence calculations.
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#discussion_r1214940147


##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -34,14 +32,14 @@ use std::sync::Arc;
 /// This is used to represent both:
 ///
 /// 1. Equality conditions (like `A=B`), when `T` = [`Column`]
-/// 2. Ordering (like `A ASC = B ASC`), when `T` = [`OrderedColumn`]
+/// 2. Ordering (like `A ASC = B ASC`), when `T` = [`PhysicalSortExpr`]
 #[derive(Debug, Clone)]
 pub struct EquivalenceProperties<T = Column> {

Review Comment:
   Yes, we can hopefully replace `Column` with `Arc<dyn PhysicalExpr>` in subsequent PRs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#discussion_r1214931884


##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -115,6 +113,53 @@ impl<T: Eq + Hash + Clone> EquivalenceProperties<T> {
     }
 }
 
+/// Remove duplicates inside the `in_data` vector, returned vector would consist of unique entries
+fn deduplicate_vector<T: PartialEq>(in_data: Vec<T>) -> Vec<T> {
+    let mut result = vec![];
+    for elem in in_data {
+        if !result.contains(&elem) {
+            result.push(elem);
+        }
+    }
+    result
+}
+
+/// Find the position of `entry` inside `in_data`, if `entry` is not found return `None`.
+fn get_entry_position<T: PartialEq>(in_data: &[T], entry: &T) -> Option<usize> {
+    in_data.iter().position(|item| item.eq(entry))
+}
+
+/// Remove `entry` for the `in_data`, returns `true` if removal is successful (e.g `entry` is indeed in the `in_data`)
+/// Otherwise return `false`
+fn remove_from_vec<T: PartialEq>(in_data: &mut Vec<T>, entry: &T) -> bool {
+    if let Some(idx) = get_entry_position(in_data, entry) {
+        in_data.remove(idx);
+        true
+    } else {
+        false
+    }
+}
+
+// Helper function to calculate column info recursively
+fn get_column_infos_helper(
+    indices: &mut Vec<(usize, String)>,
+    expr: &Arc<dyn PhysicalExpr>,
+) {
+    if let Some(col) = expr.as_any().downcast_ref::<Column>() {
+        indices.push((col.index(), col.name().to_string()))
+    } else if let Some(binary_expr) = expr.as_any().downcast_ref::<BinaryExpr>() {
+        get_column_infos_helper(indices, binary_expr.left());
+        get_column_infos_helper(indices, binary_expr.right());
+    };
+}
+
+/// Get index and name of each column that is in the expression (Can return multiple entries for `BinaryExpr`s)
+fn get_column_infos(expr: &Arc<dyn PhysicalExpr>) -> Vec<(usize, String)> {

Review Comment:
   I have changed its name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#discussion_r1214775206


##########
datafusion/core/tests/sqllogictests/test_files/window.slt:
##########
@@ -2405,6 +2405,30 @@ GlobalLimitExec: skip=0, fetch=5
 ------SortExec: expr=[c9@0 DESC]
 --------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c9], has_header=true
 
+# This test shows that ordering equivalence can keep track of complex expressions (not just Column expressions)

Review Comment:
   ❤️ 



##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -34,14 +32,14 @@ use std::sync::Arc;
 /// This is used to represent both:
 ///
 /// 1. Equality conditions (like `A=B`), when `T` = [`Column`]
-/// 2. Ordering (like `A ASC = B ASC`), when `T` = [`OrderedColumn`]
+/// 2. Ordering (like `A ASC = B ASC`), when `T` = [`PhysicalSortExpr`]
 #[derive(Debug, Clone)]
 pub struct EquivalenceProperties<T = Column> {

Review Comment:
   🤔  maybe eventually we will do the same thing for `equality analysis as well`



##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -115,6 +113,53 @@ impl<T: Eq + Hash + Clone> EquivalenceProperties<T> {
     }
 }
 
+/// Remove duplicates inside the `in_data` vector, returned vector would consist of unique entries
+fn deduplicate_vector<T: PartialEq>(in_data: Vec<T>) -> Vec<T> {
+    let mut result = vec![];
+    for elem in in_data {
+        if !result.contains(&elem) {
+            result.push(elem);
+        }
+    }
+    result
+}
+
+/// Find the position of `entry` inside `in_data`, if `entry` is not found return `None`.
+fn get_entry_position<T: PartialEq>(in_data: &[T], entry: &T) -> Option<usize> {
+    in_data.iter().position(|item| item.eq(entry))
+}
+
+/// Remove `entry` for the `in_data`, returns `true` if removal is successful (e.g `entry` is indeed in the `in_data`)
+/// Otherwise return `false`
+fn remove_from_vec<T: PartialEq>(in_data: &mut Vec<T>, entry: &T) -> bool {

Review Comment:
   Perhaps a more idiomatic way would be for this function to return `Option<T>` (which is what `Some(in_data.remove())` returns )
   
   That might allow you to avoid some of the other changes to `remove` later



##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -551,4 +555,52 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]

Review Comment:
   👍 



##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -115,6 +113,53 @@ impl<T: Eq + Hash + Clone> EquivalenceProperties<T> {
     }
 }
 
+/// Remove duplicates inside the `in_data` vector, returned vector would consist of unique entries
+fn deduplicate_vector<T: PartialEq>(in_data: Vec<T>) -> Vec<T> {
+    let mut result = vec![];
+    for elem in in_data {
+        if !result.contains(&elem) {
+            result.push(elem);
+        }
+    }
+    result
+}
+
+/// Find the position of `entry` inside `in_data`, if `entry` is not found return `None`.
+fn get_entry_position<T: PartialEq>(in_data: &[T], entry: &T) -> Option<usize> {
+    in_data.iter().position(|item| item.eq(entry))
+}
+
+/// Remove `entry` for the `in_data`, returns `true` if removal is successful (e.g `entry` is indeed in the `in_data`)
+/// Otherwise return `false`
+fn remove_from_vec<T: PartialEq>(in_data: &mut Vec<T>, entry: &T) -> bool {
+    if let Some(idx) = get_entry_position(in_data, entry) {
+        in_data.remove(idx);
+        true
+    } else {
+        false
+    }
+}
+
+// Helper function to calculate column info recursively
+fn get_column_infos_helper(
+    indices: &mut Vec<(usize, String)>,
+    expr: &Arc<dyn PhysicalExpr>,
+) {
+    if let Some(col) = expr.as_any().downcast_ref::<Column>() {
+        indices.push((col.index(), col.name().to_string()))
+    } else if let Some(binary_expr) = expr.as_any().downcast_ref::<BinaryExpr>() {
+        get_column_infos_helper(indices, binary_expr.left());
+        get_column_infos_helper(indices, binary_expr.right());
+    };
+}
+
+/// Get index and name of each column that is in the expression (Can return multiple entries for `BinaryExpr`s)
+fn get_column_infos(expr: &Arc<dyn PhysicalExpr>) -> Vec<(usize, String)> {

Review Comment:
   Maybe 'get_column_index` would better explain what this function is doing



##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -198,58 +247,7 @@ impl<T: Eq + Hash + Clone> EquivalentClass<T> {
     }
 }
 
-/// This object represents a [`Column`] with a definite ordering, for

Review Comment:
   🎉 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo merged pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo merged PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#issuecomment-1574428990

   > I think this PR looks great -- thank you @mustafasrepo and adds a neat feature. cc @mingmwang in case you have any interest in reviewing this
   > 
   > > However, because PhysicalSortExpr doesn't implement Hash trait (there is no trivial way to support this trait if any). We changed the EquivalentClass implementation so that it doesn't require Hash trait anymore.
   > 
   > We hit something similar when trying to make `LogicalPlan` implement hash (because of the `LogicalPlan::Extension` variant that has a `Arc<dyn UserDefinedLogicalNode>`
   > 
   > The solution we came up with was
   > 
   > https://docs.rs/datafusion-expr/25.0.0/datafusion_expr/logical_plan/trait.UserDefinedLogicalNode.html#tymethod.dyn_hash
   > 
   > And then implemented it like this: https://docs.rs/datafusion-expr/25.0.0/src/datafusion_expr/logical_plan/extension.rs.html#235-285
   
   I will experiment with using `dyn hash`, I think this will simplify the structure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#discussion_r1218503788


##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -115,6 +113,53 @@ impl<T: Eq + Hash + Clone> EquivalenceProperties<T> {
     }
 }
 
+/// Remove duplicates inside the `in_data` vector, returned vector would consist of unique entries
+fn deduplicate_vector<T: PartialEq>(in_data: Vec<T>) -> Vec<T> {
+    let mut result = vec![];
+    for elem in in_data {
+        if !result.contains(&elem) {
+            result.push(elem);
+        }
+    }
+    result
+}
+
+/// Find the position of `entry` inside `in_data`, if `entry` is not found return `None`.
+fn get_entry_position<T: PartialEq>(in_data: &[T], entry: &T) -> Option<usize> {
+    in_data.iter().position(|item| item.eq(entry))
+}
+
+/// Remove `entry` for the `in_data`, returns `true` if removal is successful (e.g `entry` is indeed in the `in_data`)
+/// Otherwise return `false`
+fn remove_from_vec<T: PartialEq>(in_data: &mut Vec<T>, entry: &T) -> bool {

Review Comment:
   makes sense -- thank you for the response



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6501: Support ordering analysis with expressions (not just columns) by Replace `OrderedColumn` with `PhysicalSortExpr`

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6501:
URL: https://github.com/apache/arrow-datafusion/pull/6501#discussion_r1214945396


##########
datafusion/physical-expr/src/equivalence.rs:
##########
@@ -115,6 +113,53 @@ impl<T: Eq + Hash + Clone> EquivalenceProperties<T> {
     }
 }
 
+/// Remove duplicates inside the `in_data` vector, returned vector would consist of unique entries
+fn deduplicate_vector<T: PartialEq>(in_data: Vec<T>) -> Vec<T> {
+    let mut result = vec![];
+    for elem in in_data {
+        if !result.contains(&elem) {
+            result.push(elem);
+        }
+    }
+    result
+}
+
+/// Find the position of `entry` inside `in_data`, if `entry` is not found return `None`.
+fn get_entry_position<T: PartialEq>(in_data: &[T], entry: &T) -> Option<usize> {
+    in_data.iter().position(|item| item.eq(entry))
+}
+
+/// Remove `entry` for the `in_data`, returns `true` if removal is successful (e.g `entry` is indeed in the `in_data`)
+/// Otherwise return `false`
+fn remove_from_vec<T: PartialEq>(in_data: &mut Vec<T>, entry: &T) -> bool {

Review Comment:
   Since, we remove by giving element inside the vector. We already have removed element. If we return `Option<T>` the value inside `Option` will be `entry` argument to the function. Hence this function is more akin to `HashSet` `remove`.  Also inside `remove` function we are interested in whether removal was successful, in this case we need to introduce `is_some` checks inside `remove` function.
   Hence I think, current API is more clear, However, if it is misleading, or counter intuitive I can implement as your suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org