You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2016/07/17 16:54:58 UTC

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14240

    [SPARK-16594] [SQL] Remove Physical Plan Differences when Table Scan Having Duplicate Columns

    #### What changes were proposed in this pull request?
    Currently, we keep two implementations for planning scans over data sources. There is one difference between two implementation when deciding whether a `Project` is needed or not. 
    
    - Data Source Table Scan: https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L322-L395
    
      - `SELECT b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
    
    - Hive Table Scan and In-memory Table Scan:
    https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L71-L99 
    
      - `SELECT b, b FROM oneToTenPruned`: **_No `ProjectExec` is added._** 
    
    **Note.** When alias is being used, we will always add `ProjectExec` in all the scan types.
    
      - `SELECT b as alias_b, b FROM oneToTenPruned`: **_Add `ProjectExec`._**
    
    Because selecting the same column twice without adding `alias` is very weird, no clue why the code needs to behave differently here. This PR is to remove the differences. 
    
    #### How was this patch tested?
    Updated/added a test case

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark removeProject

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14240
    
----
commit ba7dcff8829c379666c417b87fc1e5022ed93141
Author: gatorsmile <ga...@gmail.com>
Date:   2016-07-17T07:26:47Z

    fix

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71545159
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    The following condition will be false if an alias is used: https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62433/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71471878
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    Yeah!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71643068
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    After this PR, Data Source Table Scan will return two columns too, just like what Hive Table Scan does. This is also shown in the the [test case](https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala#L117)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    The following condition will be false if an alias is used: https://github.com/gatorsmile/spark/blob/ba7dcff8829c379666c417b87fc1e5022ed93141/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71585784
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    me too. : ( I am also learning this part for refactoring the Hive table scan. That is why I found the behavior inconsistency between different types of table scans. Let me try to summarize it in a more organized way. 
    
    Currently, when converting logical plans to physical plans, we have two different strategies for the table scan (+ the adjacent `Filter` and `Project`, if any). 
    - Hive Table Scan and In-memory Table Scan.
    - Data Source Table Scan.
    
    Thus, the basic functionalities are the same. 
    
    The inputs include the filter predicates and the output project list and a leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`. The inputs are generated by the `PhysicalOperation` pattern, which extracts and normalizes Scan, Projects and Filters.
    
    The target is to generate a physical plan, like:
    ```
    ProjectExec 
    +- FilterExec
       +- TableScan
    ```
    
    `ProjectExec` is optional. When the filter predicates contain the unnecessary attributes or output project list has aliases, we have to add `ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics is implemented by the following conditions in both `pruneFilterProject` functions:
    ```Scala
    AttributeSet(projectList.map(_.toAttribute)) == projectSet &&
      filterSet.subsetOf(projectSet))
    ```
    The source codes:
    - [The original `pruneFilterProject` for Data Source Table Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366)
    - [The `pruneFilterProject` for Hive Table Scan and In-memory Table Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89
    )
    
    The above code shows that Data Source Table Scan has one extra condition:
    ```Scala
    projectSet.size == projects.size
    ``` 
    
    This condition is very specific for a rare case. Users select duplicate columns without using any alias name. Thus, to make them consistent, we are facing two options:
    - add this condition to both scan scenarios
    - remove this condition from both
    
    Either is fine to me, but I think we need to make them consistent. Let me know if my explanation is clear.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71471414
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    so `SELECT b, b FROM oneToTenPruned` will return 2 columns and `SELECT b as alias_b, b FROM oneToTenPruned` only returns one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile closed the pull request at:

    https://github.com/apache/spark/pull/14240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    is it actually a bug? `SELECT b, b FROM oneToTenPruned` may return one column while we expect 2 columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71562777
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    When we do not return two columns, we will always have a `ProjectExec` above the table scan:
    ```
    *Project [b#222 AS alias_b#240, b#222]
    +- InMemoryTableScan [b#222]
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71634679
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    after your PR, will data source table return only one column for this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71634358
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    ```SQL
    SELECT b, b FROM oneToTenPruned
    ```
    For Hive Table Scan, we exclude `ProjectExec` from the physical plan but [the Hive table scan returns two duplicate columns](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L93). No matter whether `projectSet.size == projects.size` is added or not, the result is always right. 
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    cc @cloud-fan Could you review this simple fix?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71548579
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    but this doesn't make sense, when user select 2 columns, how can we return only one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71565387
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    so what does the column number here means? Sorry I'm not familiar with this part of code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71631823
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    so for `SELECT b, b FROM oneToTenPruned` with hive table, we will exclude the `ProjectExec` and thus output only one column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71938022
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    Found something interesting in File Source Scan. Will submit a separate PR for resolving it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71839551
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    `FileSourceStrategy` is worse. The order also matters!
    https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L148
    
    The output of file source scan is always following `readDataColumns ++ partitionColumns`.
    https://github.com/apache/spark/blob/37f3be5d29192db0a54f6c4699237b149bd0ecae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L140
    
    If the order is different from `readDataColumns ++ partitionColumns`, we will add a `ProjectExec`. This does not sound good to me.
    
    The next question is why `partitionColumns` is always part of the scan's output. Will try to find the answer tomorrow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71644651
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    so we are assuming the `PrunedScan.buildScan` can handle duplicated columns? I feel it's dangerous to assume that as we didn't document this semantic, the external implementation may break it.
    
    For the hive one, it's internal stuff and we are safe to make this assumption. If we do wanna make them consistent, I'd like to change the hive one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71506928
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    I don't quite understand this, why they are different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    **[Test build #62433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62433/consoleFull)** for PR 14240 at commit [`ba7dcff`](https://github.com/apache/spark/commit/ba7dcff8829c379666c417b87fc1e5022ed93141).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14240: [SPARK-16594] [SQL] Remove Physical Plan Differen...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14240#discussion_r71646312
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
    @@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with SharedSQLContext {
       testPruning("SELECT * FROM oneToTenPruned", "a", "b")
       testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
       testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
    -  testPruning("SELECT b, b FROM oneToTenPruned", "b")
    +  testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
    +  testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
    --- End diff --
    
    I like your point! Your concern is valid. We should be careful when making an assumption on the external data source implementation! : ) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    @cloud-fan It is not a bug. I prefer to make them consistent. I did a few performance test and find it makes sense to return only one column, and then do the Filter, and then Project will generate two duplicate columns. This should be faster when Filter can remove most of rows. 
    
    However, this optimization condition `projectSet.size == projects.size` is very specific in this rare case: `SELECT b, b FROM oneToTenPruned`. It does not make sense to write such columns without specifying an alias. If using the alias, we will always return one column. This PR removed this condition, instead of adding the condition into the `Data Source Table Scan`.
    
    Let me know what is your opinion. Thanks!
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    **[Test build #62433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62433/consoleFull)** for PR 14240 at commit [`ba7dcff`](https://github.com/apache/spark/commit/ba7dcff8829c379666c417b87fc1e5022ed93141).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14240: [SPARK-16594] [SQL] Remove Physical Plan Differences whe...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14240
  
    Since File Scan is completely different from Data Source Table Scan, Hive Table Scan and In-memory Table Scan, it does not make sense to make all of them consistent. Close it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org