You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/05 23:38:05 UTC

[GitHub] [arrow-datafusion] jonmmease opened a new pull request, #2463: Fix projection pushdown produces incorrect results when column names are reused

jonmmease opened a new pull request, #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463

   # Which issue does this PR close?
   
   Closes #2462.
   
   This PR adds the initially failing test `select_with_alias_overwrite`.  The simplest fix (that I've made in f3e87d8dde2ffb1d622fc30d53a1330b335dbc9b) is to revert #747.
   
   Let me know if you see a way to capture the optimization intended by #747 that works in the presence of overwritten column values.
   
   Thanks for taking a look!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#discussion_r870737808


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -172,16 +172,7 @@ fn optimize_plan(
                 _execution_props,
             )?;
 
-            let new_required_columns_optimized = new_input
-                .schema()
-                .fields()
-                .iter()
-                .map(|f| f.qualified_column())
-                .collect::<HashSet<Column>>();
-
-            if new_fields.is_empty()
-                || (has_projection && &new_required_columns_optimized == required_columns)

Review Comment:
   I wonder if the issue is that this code is just checking column names to detect a reorder. The real check, as exposed in your reproducer, is that the the column names are the same *AND* that the expressions are only column references or constants... In the case of expressions (like in your reproducer) equal names is not sufficient
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1123885836

   Thanks @alamb. Happy to make updates to tests and/or logic this evening as you have suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1125039566

   @alamb I'm seeing a failure in "Run Ballista Tests" and I can't tell if they are related to this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1126012040

   @alamb all green now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1122840778

   Thanks @jonmmease  -- I will look into this more carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#discussion_r872667686


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -179,8 +180,12 @@ fn optimize_plan(
                 .map(|f| f.qualified_column())
                 .collect::<HashSet<Column>>();
 
+            let all_column_exprs = new_expr.iter().all(|e| matches!(e, Expr::Column(_)));

Review Comment:
   The only question I have here is "what if the column is in an `alias` or something. However since the tests pass 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1121651698

   @alamb I think the first question is whether we're OK with backing out #747.  I belive the tests that are failing are checking for this specific optimization.  I didn't want to remove those tests if what we actually need to do is change how the optimization is triggered (though it wasn't obvious to me how to do this without introducing the error described in https://github.com/apache/arrow-datafusion/issues/2462).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on a diff in pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on code in PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#discussion_r870822861


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -172,16 +172,7 @@ fn optimize_plan(
                 _execution_props,
             )?;
 
-            let new_required_columns_optimized = new_input
-                .schema()
-                .fields()
-                .iter()
-                .map(|f| f.qualified_column())
-                .collect::<HashSet<Column>>();
-
-            if new_fields.is_empty()
-                || (has_projection && &new_required_columns_optimized == required_columns)

Review Comment:
   That makes sense. I'll give it a try tomorrow unless you get to it first!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1119148681

   Unsurprisingly, currently the `redundunt_project` test fails.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1121498527

   The CI tests  in this PR appear to be failing -- @jonmmease are you looking into those failures? Or would you like me to give it a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1125422327

    > @alamb I'm seeing a failure in "Run Ballista Tests" and I can't tell if they are related to this PR
   
   I don't think they are related to this PR; The issue seems to be
   
   https://github.com/apache/arrow-datafusion/runs/6406466598?check_suite_focus=true
   
   ```
   nodefaultlibs"
     = note: /usr/bin/ld: final link failed: No space left on device
             collect2: error: ld returned 1 exit status
   ```
   
   I think @andygrove  was seeing something similar yesterday but it seems to have resolved itself. I will trigger the test again and see if it passes this time 🤷 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#issuecomment-1125973251

   Failed again, just tried rebasing on master :crossed_fingers: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jonmmease commented on a diff in pull request #2463: Fix projection pushdown produces incorrect results when column names are reused

Posted by GitBox <gi...@apache.org>.
jonmmease commented on code in PR #2463:
URL: https://github.com/apache/arrow-datafusion/pull/2463#discussion_r871425114


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -172,16 +172,7 @@ fn optimize_plan(
                 _execution_props,
             )?;
 
-            let new_required_columns_optimized = new_input
-                .schema()
-                .fields()
-                .iter()
-                .map(|f| f.qualified_column())
-                .collect::<HashSet<Column>>();
-
-            if new_fields.is_empty()
-                || (has_projection && &new_required_columns_optimized == required_columns)

Review Comment:
   Attempted in [3903566](https://github.com/apache/arrow-datafusion/pull/2463/commits/39035668f5242e504eb010cc15aada2709775305)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org