You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/25 15:42:33 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

alamb opened a new pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619


   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/617 but I am still not sure if this is a bug or not (explained below)
   
    # Rationale for this change
   As explained on https://github.com/apache/arrow-datafusion/issues/617 the projection pushdown operation removes  columns from a scan when the `LogicalPlan::Projection` had unqualified columns (so an expression like `a`, rather than `table.a`).
   
   In my case in IOx this was occuring due to an extension node, when my code was creating the exprs without qualifiers. When I updated to the code to create the expression with qualifiers things are fine again
   
   ```diff
            let exprs = input
                .schema()
                .fields()
                .iter()
   -            .map(|field| logical_plan::col(field.name()))
   +            .map(|field| Expr::Column(field.qualified_column()))
                .collect::<Vec<_>>();
   ```
   
   Thus I am not sure if the projection pushdown code should actually handle this case, or if we should have some sort of error / warning instead. Having it silently produce the wrong answer (which is what happened in IOx) was hard to debug
   
   # What changes are included in this PR?
   1. Only check relational qualifier if there a relational qualifier on the expression, otherwise just check field name
   2. Ensure no duplicate field names by using BTreeSet
   
   # Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#issuecomment-868645653


   FYI @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658910893



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       curious when should we use a HashSet v.s. BTreeSet?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658910893



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       curious when should we use a HashSet v.s. BTreeSet? EDIT: nvm, so you removed the sort afterwards :P




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r659159598



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       Yeah - exactly - I used the BTreeSet as we needed the ids sorted anywyas

##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       Yeah - exactly - I used the BTreeSet as we needed the ids sorted anyways




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan merged pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

Dandandan merged pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658910526



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns
         .iter()
-        .filter(|c| c.relation.as_ref() == table_name)
+        .filter(|c| c.relation.is_none() || c.relation.as_ref() == table_name)

Review comment:
       good catch 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658910893



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       curious when should we use a HashSet v.s. BTreeSet? EDIT: nvm, saw you removed the sort afterwards :P




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658862205



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns

Review comment:
       When I did not use a `BTreeSet` one of the `UNION ALL` tests failed due to a duplicate column being projected. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #619: RFC: Do not prune out unnecessary columns with unqualified references

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #619:
URL: https://github.com/apache/arrow-datafusion/pull/619#discussion_r658863004



##########
File path: datafusion/src/optimizer/projection_push_down.rs
##########
@@ -75,9 +78,12 @@ fn get_projected_schema(
     //
     // we discard non-existing columns because some column names are not part of the schema,
     // e.g. when the column derives from an aggregation
-    let mut projection: Vec<usize> = required_columns
+    //
+    // Use BTreeSet to remove potential duplicates (e.g. union) as
+    // well as to sort the projection to ensure deterministic behavior
+    let mut projection: BTreeSet<usize> = required_columns
         .iter()
-        .filter(|c| c.relation.as_ref() == table_name)
+        .filter(|c| c.relation.is_none() || c.relation.as_ref() == table_name)

Review comment:
       This is the core change -- don't compare the relation qualifier if there is none -- otherwise if `c = Column { relation: None, name: "a"}` and the table name is `Some("foo")` the column will be filtered, even if `foo` has a column named `a`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org