You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/08 10:30:15 UTC

[GitHub] [arrow-datafusion] jackwener opened a new pull request, #4141: fix: shouldn't pass alias through into subquery.

jackwener opened a new pull request, #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4123.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   This PR has fixed a bug(We can see bug in UT and test case, but none discover it).
   
   Original code passed the outside `Alias` into `Subquery`, it's wrong.
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#discussion_r1018486987


##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -746,7 +746,7 @@ async fn test_physical_plan_display_indent_multi_children() {
         "          ProjectionExec: expr=[c2@0 as c2]",
         "            ProjectionExec: expr=[c1@0 as c2]",
         "              RepartitionExec: partitioning=RoundRobinBatch(9000)",
-        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1]",
+        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1, c2]",

Review Comment:
   It is confusing the alias `c2` with the column `c2`, so I am concerned that this could also introduce correctness issues. It would be good to understand this better before we merge.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1313930570

   @jackwener Thanks for tracking down the bug and filing an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1307973264

   I took a first pass through this, and it seems to make sense, but I was confused by some of the expected results. I'll take another look tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1314750164

   Thanks @jackwener 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
jackwener commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1311637198

   @alamb @andygrove 
   I have find the reason of `projection include [c1, c2]`.
   It's because the `pushdown project` rule don't handle `alias`. In fact, when handle `projection with alias`, we need replace the `alias column` in `new_required_columns`.
   
   For example:
   
   ```
   project c1 as c2
       tablescan c1.
   ```
   
   when we meet `project c1 as c2`, if `new_required_columns` include c2, we need replace it with c1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
jackwener commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1312954630

   cc @mingmwang @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#discussion_r1016440748


##########
datafusion/sql/src/planner.rs:
##########
@@ -778,7 +778,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 let normalized_alias = alias.as_ref().map(|a| normalize_ident(&a.name));
                 let logical_plan = self.query_to_plan_with_alias(
                     *subquery,
-                    normalized_alias.clone(),
+                    None,

Review Comment:
   We shouldn't pass the `alias` into subquery inside.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#discussion_r1020160272


##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -746,7 +746,7 @@ async fn test_physical_plan_display_indent_multi_children() {
         "          ProjectionExec: expr=[c2@0 as c2]",
         "            ProjectionExec: expr=[c1@0 as c2]",
         "              RepartitionExec: partitioning=RoundRobinBatch(9000)",
-        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1]",
+        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1, c2]",

Review Comment:
   I think it should be caused by `project_pushdown rule`, I will look into it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#discussion_r1017146483


##########
datafusion/core/tests/sql/explain_analyze.rs:
##########
@@ -746,7 +746,7 @@ async fn test_physical_plan_display_indent_multi_children() {
         "          ProjectionExec: expr=[c2@0 as c2]",
         "            ProjectionExec: expr=[c1@0 as c2]",
         "              RepartitionExec: partitioning=RoundRobinBatch(9000)",
-        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1]",
+        "                CsvExec: files=[ARROW_TEST_DATA/csv/aggregate_test_100.csv], has_header=true, limit=None, projection=[c1, c2]",

Review Comment:
   This doesn't seem right to me.
   
   The query is only selecting `c1` from the table -- so why is this scan now selecting c1 *and* c2?
   
   ```
       let sql = "SELECT c1 \
                  FROM (select c1 from aggregate_test_100) AS a \
                  JOIN\
                  (select c1 as c2 from aggregate_test_100) AS b \
                  ON c1=c2\
                  ";
   ```



##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1521,8 +1521,8 @@ async fn reduce_left_join_3() -> Result<()> {
         "Explain [plan_type:Utf8, plan:Utf8]",
         "  Projection: t3.t1_id, t3.t1_name, t3.t1_int, t2.t2_id, t2.t2_name, t2.t2_int [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
         "    Left Join: t3.t1_int = t2.t2_int [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
-        "      Projection: t3.t1_id, t3.t1_name, t3.t1_int, alias=t3 [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "        Projection: t1.t1_id, t1.t1_name, t1.t1_int, alias=t3 [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",

Review Comment:
   it sure looks better to have not have this double alias



##########
benchmarks/expected-plans/q13.txt:
##########
@@ -2,8 +2,8 @@ Sort: custdist DESC NULLS FIRST, c_orders.c_count DESC NULLS FIRST
   Projection: c_orders.c_count, COUNT(UInt8(1)) AS custdist
     Aggregate: groupBy=[[c_orders.c_count]], aggr=[[COUNT(UInt8(1))]]
       Projection: c_orders.COUNT(orders.o_orderkey) AS c_count, alias=c_orders
-        Projection: c_orders.COUNT(orders.o_orderkey), alias=c_orders
-          Projection: COUNT(orders.o_orderkey), alias=c_orders
+        Projection: COUNT(orders.o_orderkey), alias=c_orders
+          Projection: COUNT(orders.o_orderkey)

Review Comment:
   This redundant projection seems unfortunate (though not caused by this PR)



##########
datafusion/sql/src/planner.rs:
##########
@@ -3170,10 +3170,10 @@ mod tests {
                      ) AS a
                    ) AS b";
         let expected = "Projection: b.fn2, b.last_name\
-                        \n  Projection: b.fn2, b.last_name, b.birth_date, alias=b\
-                        \n    Projection: a.fn1 AS fn2, a.last_name, a.birth_date, alias=b\

Review Comment:
   yeah looking at these old plans now this selecting a just to alias them as `b` is a little strange



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 merged pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
liukun4515 merged PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#issuecomment-1314753051

   Benchmark runs are scheduled for baseline = 9de893e54bba2c70782f4e684da70653b9241499 and contender = d91ce0684646f07edc8f182214f15845a9efab97. d91ce0684646f07edc8f182214f15845a9efab97 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/52b5fb3febe644a58f6deecac7e190a5...3ba75d1fe89940d888b1580fe3d6f382/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/ee27c96bf5d94a689c65926b587042ba...aeb867307f854d3cb1b143193f290cb4/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/00befacfa2f9482aa8ccc5ec541990ac...feb2a39c8e1745d4ba1214c6dd27bd2b/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/7a22d81cd07640a9a9cb10b701472f5f...2a36d3aee31941fba8841196e1ad1736/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #4141: fix: shouldn't pass alias through into subquery.

Posted by GitBox <gi...@apache.org>.
jackwener commented on code in PR #4141:
URL: https://github.com/apache/arrow-datafusion/pull/4141#discussion_r1016440748


##########
datafusion/sql/src/planner.rs:
##########
@@ -778,7 +778,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 let normalized_alias = alias.as_ref().map(|a| normalize_ident(&a.name));
                 let logical_plan = self.query_to_plan_with_alias(
                     *subquery,
-                    normalized_alias.clone(),
+                    None,

Review Comment:
   Fix bug here, because we shouldn't pass the `alias` into subquery inside.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org