You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by xuanyuanking <gi...@git.apache.org> on 2018/08/24 15:59:04 UTC

[GitHub] spark pull request #22222: [SPARK-25083][SQL] Remove the type erasure hack i...

GitHub user xuanyuanking opened a pull request:

    https://github.com/apache/spark/pull/22222

    [SPARK-25083][SQL] Remove the type erasure hack in data source scan

    ## What changes were proposed in this pull request?
    
    1. Add function `inputBatchRDDs` and `inputRowRDDs` interface in `ColumnarBatchScan`.
    2.rewrite them in physical node which extends `ColumnarBatchScan`.
    
    ## How was this patch tested?
    
    Refactor work, test with existing UT.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuanyuanking/spark SPARK-25083

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22222
    
----
commit 992a08b1d77d59daeac95c67d07e5b8efe20ce20
Author: Yuanjian Li <xy...@...>
Date:   2018-08-24T15:54:27Z

    [SPARK-25083][SQL] Remove the type erasure hack in data source scan

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95261/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95422/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by xuanyuanking <gi...@git.apache.org>.

Github user xuanyuanking commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Got it, I'll revert the changes in file source in this commit, thanks for your reply.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by rdblue <gi...@git.apache.org>.

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    @xuanyuanking, while this does remove the hack, it doesn't address the underlying problem. The problem is that there is a single RDD, which may contain InternalRow or may contain ColumnarBatch. Generated code knows how to differentiate between the two and use the RDD contents correctly.
    
    While this is an improvement because it uses the actual type of records in the RDD, the work that needs to be done is to update the columnar case so that it does return an `RDD[InternalRow]` for anyone that accesses data using that RDD, and then update the generated code to detect a data source RDD and access the underlying `RDD[ColumnarBatch]`.
    
    Here's some pseudo-code to demonstrate what I mean. The current code does something like this with a cast. Your change wouldn't fix the need to cast to `RDD[ColumnarBatch]`:
    ```scala
    def doExecute(rdd: DataSourceRDD[InternalRow]) { // with your change, DataSourceRDD[_]
      if (rdd.isColumnar) {
        doExecuteColumnarBatch(rdd.asInstanceOf[RDD[ColumnarBatch]])
      } else {
        doExecuteRows(rdd)
      }
    }
    ```
    
    I think that should be changed to something like this which is type safe:
    ```scala
    def doExecute(rdd: DataSourceRDD[InternalRow]) {
      if (rdd.isColumnar) {
        doExecuteColumnarBatch(rdd.getColumnBatchRDD)
      } else {
        doExecuteRows(rdd)
      }
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    **[Test build #95422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95422/testReport)** for PR 22222 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    **[Test build #97865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97865/testReport)** for PR 22222 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    **[Test build #97845 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97845/testReport)** for PR 22222 at commit [`fdc1efc`](https://github.com/apache/spark/commit/fdc1efcdefe4b9bf002ce43ed1dfd7ab258218ca).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    +1 on @rdblue 's idea. One point is, we should use `ColumnarBatchScan.supportsBatch` to indicate columnar scan or not, instead of asking the RDD to report it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22222: [SPARK-25083][SQL] Remove the type erasure hack in data ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22222
  
    **[Test build #95261 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95261/testReport)** for PR 22222 at commit [`7e88599`](https://github.com/apache/spark/commit/7e88599dfc2caf177d12e890d588be68bdd3bc8e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org