You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/01/04 15:23:01 UTC

[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20153

    [SPARK-22392][SQL] data source v2 columnar batch reader

    ## What changes were proposed in this pull request?
    
    a new Data Source V2 interface to allow the data source to return `ColumnarBatch` during the scan.
    
    ## How was this patch tested?
    
    new tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark columnar-reader

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20153
    
----

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161265463
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -17,21 +17,24 @@
     
     package org.apache.spark.sql.execution
     
    -import org.apache.spark.sql.catalyst.expressions.UnsafeRow
    +import org.apache.spark.sql.catalyst.expressions.{BoundReference, UnsafeRow}
     import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
     import org.apache.spark.sql.execution.metric.SQLMetrics
     import org.apache.spark.sql.types.DataType
     import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}
     
     
     /**
    - * Helper trait for abstracting scan functionality using
    - * [[ColumnarBatch]]es.
    + * Helper trait for abstracting scan functionality using [[ColumnarBatch]]es.
      */
     private[sql] trait ColumnarBatchScan extends CodegenSupport {
     
       def vectorTypes: Option[Seq[String]] = None
     
    +  protected def supportsBatch: Boolean = true
    --- End diff --
    
    Add a comment to explain `supportsBatch `?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86054/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86054/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    I will look at this tomorrow.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85683 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85683/testReport)** for PR 20153 at commit [`df89a83`](https://github.com/apache/spark/commit/df89a833fb3db3726f45e8a5982b0006f231fd98).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])`
      * `class DataSourceRDD[T: ClassTag](`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85968/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    also cc @rxin


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86160/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86080/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86080/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Thanks! Merged to master/2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85968/testReport)** for PR 20153 at commit [`b8a700d`](https://github.com/apache/spark/commit/b8a700d87d3708bae34054a00ad5d489280e5852).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160477447
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    `enableColumnarRead()` or `enableColumnarBatchRead()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86160/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160746791
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    --- End diff --
    
    We need to explain the precedence of `SupportsScanColumnarBatch ` and `SupportsScanUnsafeRow`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161034623
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala ---
    @@ -37,40 +35,58 @@ import org.apache.spark.sql.types.StructType
      */
     case class DataSourceV2ScanExec(
         fullOutput: Seq[AttributeReference],
    -    @transient reader: DataSourceV2Reader) extends LeafExecNode with DataSourceReaderHolder {
    +    @transient reader: DataSourceV2Reader)
    +  extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {
     
       override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]
     
    -  override def references: AttributeSet = AttributeSet.empty
    +  override def producedAttributes: AttributeSet = AttributeSet(fullOutput)
     
    -  override lazy val metrics = Map(
    -    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
    +  private lazy val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match {
    +    case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks()
    +    case _ =>
    +      reader.createReadTasks().asScala.map {
    +        new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow]
    +      }.asJava
    +  }
     
    -  override protected def doExecute(): RDD[InternalRow] = {
    -    val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match {
    -      case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks()
    -      case _ =>
    -        reader.createReadTasks().asScala.map {
    -          new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow]
    -        }.asJava
    -    }
    +  private lazy val inputRDD: RDD[InternalRow] = reader match {
    +    case r: SupportsScanColumnarBatch if r.enableBatchRead() =>
    +      assert(!reader.isInstanceOf[ContinuousReader],
    +        "continuous stream reader does not support columnar read yet.")
    +      new DataSourceRDD(sparkContext, r.createBatchReadTasks()).asInstanceOf[RDD[InternalRow]]
    +
    +    case _ =>
    --- End diff --
    
    we can combine the child case clause with the outer one, like:
    ```
    reader match {
        case r: SupportsScanColumnarBatch if r.enableBatchRead() => ......
        case _: ContinuousReader => ......
        case _ => ......
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86054/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).
     * This patch **fails from timeout after a configured wait of \`300m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85712/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    cc @gatorsmile @kiszk @jiangxb1987 @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161362308
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    I see. It would be good to clarify it in the comment.
    For example, Is this true? `A safety door for [[ColumnarBatch]] reader.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85683/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Actually it was not merged, let me merge it to master/2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Is `ColumnarBatchScan` appropriate name for now? This is because if `supportBatch` is false, the class handles scan from a `row`, not `columnar` or `batch`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)** for PR 20153 at commit [`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DataSourceRDDPartition[T : ClassTag](val index: Int, val readTask: ReadTask[T])`
      * `class DataSourceRDD[T: ClassTag](`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85968/testReport)** for PR 20153 at commit [`b8a700d`](https://github.com/apache/spark/commit/b8a700d87d3708bae34054a00ad5d489280e5852).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160472490
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -137,4 +147,25 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport {
          """.stripMargin
       }
     
    +  private def produceRows(ctx: CodegenContext, input: String): String = {
    +    val numOutputRows = metricTerm(ctx, "numOutputRows")
    +    val row = ctx.freshName("row")
    +
    +    ctx.INPUT_ROW = row
    +    ctx.currentVars = null
    +    // Always provide `outputVars`, so that the framework can help us build unsafe row if the input
    +    // row is not unsafe row, i.e. `needsUnsafeRowConversion` is true.
    +    val outputVars = output.zipWithIndex.map{ case (a, i) =>
    --- End diff --
    
    nit: `map {`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160958116
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    Yea you can interpret it in this way (read data from columnar storage or row storage), but we can also interpret it as reading a batch of records at a time or one record at a time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161362565
  
    --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaBatchDataSourceV2.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package test.org.apache.spark.sql.sources.v2;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector;
    +import org.apache.spark.sql.sources.v2.DataSourceV2;
    +import org.apache.spark.sql.sources.v2.DataSourceV2Options;
    +import org.apache.spark.sql.sources.v2.ReadSupport;
    +import org.apache.spark.sql.sources.v2.reader.*;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.StructType;
    +import org.apache.spark.sql.vectorized.ColumnVector;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +
    +public class JavaBatchDataSourceV2 implements DataSourceV2, ReadSupport {
    +
    +  class Reader implements DataSourceV2Reader, SupportsScanColumnarBatch {
    --- End diff --
    
    This is the convention. If we implement many mix-in interfaces, it's better to write
    ```
    MyReader extends DataSourceV2Reader, XXX, YYY, ZZZ ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86080 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86080/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161362201
  
    --- Diff: sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaBatchDataSourceV2.java ---
    @@ -0,0 +1,112 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package test.org.apache.spark.sql.sources.v2;
    +
    +import java.io.IOException;
    +import java.util.List;
    +
    +import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector;
    +import org.apache.spark.sql.sources.v2.DataSourceV2;
    +import org.apache.spark.sql.sources.v2.DataSourceV2Options;
    +import org.apache.spark.sql.sources.v2.ReadSupport;
    +import org.apache.spark.sql.sources.v2.reader.*;
    +import org.apache.spark.sql.types.DataTypes;
    +import org.apache.spark.sql.types.StructType;
    +import org.apache.spark.sql.vectorized.ColumnVector;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +
    +public class JavaBatchDataSourceV2 implements DataSourceV2, ReadSupport {
    +
    +  class Reader implements DataSourceV2Reader, SupportsScanColumnarBatch {
    --- End diff --
    
    Doesn't `SupportsScanColumnarBatch` already extend `DataSourceV2Reader`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160877601
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    --- End diff --
    
    `createReadTasks not supported by default within SupportsScanColumnarBatch.`, since we allow users to fallback to normal read path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160764573
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    If it controls batch mode or non-batch mode, I agree.  
    
    IIUC, this value is used to show whether we enable to read data from column-oriented storage (e.g. `ColumnarVector`) or  row-oriented storage (e.g. `UnsafeRow`). I feel that it is not a batch mode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86072 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86072/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86072/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160880016
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala ---
    @@ -37,40 +35,58 @@ import org.apache.spark.sql.types.StructType
      */
     case class DataSourceV2ScanExec(
         fullOutput: Seq[AttributeReference],
    -    @transient reader: DataSourceV2Reader) extends LeafExecNode with DataSourceReaderHolder {
    +    @transient reader: DataSourceV2Reader)
    +  extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {
     
       override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]
     
    -  override def references: AttributeSet = AttributeSet.empty
    -
    -  override lazy val metrics = Map(
    -    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
    -
    -  override protected def doExecute(): RDD[InternalRow] = {
    -    val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match {
    -      case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks()
    -      case _ =>
    -        reader.createReadTasks().asScala.map {
    -          new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow]
    -        }.asJava
    -    }
    +  override def producedAttributes: AttributeSet = AttributeSet(fullOutput)
    +
    +  private lazy val inputRDD: RDD[InternalRow] = reader match {
    +    case r: SupportsScanColumnarBatch if r.enableBatchRead() =>
    +      assert(!reader.isInstanceOf[ContinuousReader],
    +        "continuous stream reader does not support columnar read yet.")
    +      new DataSourceRDD(sparkContext, r.createBatchReadTasks()).asInstanceOf[RDD[InternalRow]]
    +
    +    case _ =>
    +      val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match {
    +        case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks()
    +        case _ =>
    +          reader.createReadTasks().asScala.map {
    +            new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow]
    +          }.asJava
    +      }
    +
    +      reader match {
    --- End diff --
    
    This looks a bit messy, can we move `readTasks` out as a lazy val then we may have:
    ```
    private lazy val readTasks = ......
    private lazy val inputRDD: RDD[InternalRow] = reader match {
        case r: SupportsScanColumnarBatch if r.enableBatchRead() => ......
        case _: ContinuousReader => ......
        case _ => ......
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)** for PR 20153 at commit [`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161362205
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -90,14 +92,56 @@ case class InMemoryTableScanExec(
         columnarBatch
       }
     
    -  override def inputRDDs(): Seq[RDD[InternalRow]] = {
    -    assert(supportCodegen)
    +  private lazy val inputRDD: RDD[InternalRow] = {
         val buffers = filteredCachedBatches()
    -    // HACK ALERT: This is actually an RDD[ColumnarBatch].
    -    // We're taking advantage of Scala's type erasure here to pass these batches along.
    -    Seq(buffers.map(createAndDecompressColumn(_)).asInstanceOf[RDD[InternalRow]])
    +    if (supportsBatch) {
    +      // HACK ALERT: This is actually an RDD[ColumnarBatch].
    +      // We're taking advantage of Scala's type erasure here to pass these batches along.
    +      buffers.map(createAndDecompressColumn).asInstanceOf[RDD[InternalRow]]
    +    } else {
    +      val numOutputRows = longMetric("numOutputRows")
    +
    +      if (enableAccumulators) {
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r159678205
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -230,43 +274,10 @@ case class InMemoryTableScanExec(
       }
     
       protected override def doExecute(): RDD[InternalRow] = {
    -    val numOutputRows = longMetric("numOutputRows")
    --- End diff --
    
    This is moved to `inputRDD`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86072 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86072/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86160/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161361844
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks not supported by default within SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    I feel it is hard to tell from the document that if this method is used to enable batch reading or to know if this reader support batch reading.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r159678110
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ---
    @@ -346,33 +348,6 @@ case class FileSourceScanExec(
     
       override val nodeNamePrefix: String = "File"
     
    -  override protected def doProduce(ctx: CodegenContext): String = {
    --- End diff --
    
    This is moved to `ColumnarBatchScan.produceRows`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160477594
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala ---
    @@ -17,21 +17,24 @@
     
     package org.apache.spark.sql.execution
     
    -import org.apache.spark.sql.catalyst.expressions.UnsafeRow
    +import org.apache.spark.sql.catalyst.expressions.{BoundReference, UnsafeRow}
     import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, ExprCode}
     import org.apache.spark.sql.execution.metric.SQLMetrics
     import org.apache.spark.sql.types.DataType
     import org.apache.spark.sql.vectorized.{ColumnarBatch, ColumnVector}
     
     
     /**
    - * Helper trait for abstracting scan functionality using
    - * [[ColumnarBatch]]es.
    + * Helper trait for abstracting scan functionality using [[ColumnarBatch]]es.
      */
     private[sql] trait ColumnarBatchScan extends CodegenSupport {
     
       def vectorTypes: Option[Seq[String]] = None
     
    +  protected def supportsBatch: Boolean = true
    --- End diff --
    
    `supportColumnar()` or `supportColumnarBatch()`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86042 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86042/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #85683 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85683/testReport)** for PR 20153 at commit [`df89a83`](https://github.com/apache/spark/commit/df89a833fb3db3726f45e8a5982b0006f231fd98).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161252949
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala ---
    @@ -90,14 +92,56 @@ case class InMemoryTableScanExec(
         columnarBatch
       }
     
    -  override def inputRDDs(): Seq[RDD[InternalRow]] = {
    -    assert(supportCodegen)
    +  private lazy val inputRDD: RDD[InternalRow] = {
         val buffers = filteredCachedBatches()
    -    // HACK ALERT: This is actually an RDD[ColumnarBatch].
    -    // We're taking advantage of Scala's type erasure here to pass these batches along.
    -    Seq(buffers.map(createAndDecompressColumn(_)).asInstanceOf[RDD[InternalRow]])
    +    if (supportsBatch) {
    +      // HACK ALERT: This is actually an RDD[ColumnarBatch].
    +      // We're taking advantage of Scala's type erasure here to pass these batches along.
    +      buffers.map(createAndDecompressColumn).asInstanceOf[RDD[InternalRow]]
    +    } else {
    +      val numOutputRows = longMetric("numOutputRows")
    +
    +      if (enableAccumulators) {
    --- End diff --
    
    This conf is really confusing... Maybe renaming it to `enableAccumulatorsForTestingOnly`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r160747322
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanColumnarBatch.java ---
    @@ -0,0 +1,51 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader;
    +
    +import java.util.List;
    +
    +import org.apache.spark.annotation.InterfaceStability;
    +import org.apache.spark.sql.Row;
    +import org.apache.spark.sql.vectorized.ColumnarBatch;
    +
    +/**
    + * A mix-in interface for {@link DataSourceV2Reader}. Data source readers can implement this
    + * interface to output {@link ColumnarBatch} and make the scan faster.
    + */
    +@InterfaceStability.Evolving
    +public interface SupportsScanColumnarBatch extends DataSourceV2Reader {
    +  @Override
    +  default List<ReadTask<Row>> createReadTasks() {
    +    throw new IllegalStateException(
    +      "createReadTasks should not be called with SupportsScanColumnarBatch.");
    +  }
    +
    +  /**
    +   * Similar to {@link DataSourceV2Reader#createReadTasks()}, but returns columnar data in batches.
    +   */
    +  List<ReadTask<ColumnarBatch>> createBatchReadTasks();
    +
    +  /**
    +   * A safety door for columnar batch reader. It's possible that the implementation can only support
    +   * some certain columns with certain types. Users can overwrite this method and
    +   * {@link #createReadTasks()} to fallback to normal read path under some conditions.
    +   */
    +  default boolean enableBatchRead() {
    --- End diff --
    
    This name is more general. It looks fine to me. In the future, if we support another batch read mode, we can add the extra function to further identify the batch mode.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86163/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86042/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86163/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86163/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r159678434
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala ---
    @@ -37,40 +35,58 @@ import org.apache.spark.sql.types.StructType
      */
     case class DataSourceV2ScanExec(
         fullOutput: Seq[AttributeReference],
    -    @transient reader: DataSourceV2Reader) extends LeafExecNode with DataSourceReaderHolder {
    +    @transient reader: DataSourceV2Reader)
    +  extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {
     
       override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]
     
    -  override def references: AttributeSet = AttributeSet.empty
    -
    -  override lazy val metrics = Map(
    -    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
    -
    -  override protected def doExecute(): RDD[InternalRow] = {
    -    val readTasks: java.util.List[ReadTask[UnsafeRow]] = reader match {
    -      case r: SupportsScanUnsafeRow => r.createUnsafeRowReadTasks()
    -      case _ =>
    -        reader.createReadTasks().asScala.map {
    -          new RowToUnsafeRowReadTask(_, reader.readSchema()): ReadTask[UnsafeRow]
    -        }.asJava
    -    }
    +  override def producedAttributes: AttributeSet = AttributeSet(fullOutput)
    +
    +  private lazy val inputRDD: RDD[InternalRow] = reader match {
    +    case r: SupportsScanColumnarBatch if r.enableBatchRead() =>
    +      assert(!reader.isInstanceOf[ContinuousReader],
    +        "continuous stream reader does not support columnar read yet.")
    +      new DataSourceRDD(sparkContext, r.createBatchReadTasks()).asInstanceOf[RDD[InternalRow]]
    --- End diff --
    
    cc @zsxwing can streaming support columnar batch reader technically?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    I think `ColumnarBatchScan` is fine, `SupportsScanColumnarBatch` also has a `enableBatchRead` to fallback.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    **[Test build #86042 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86042/testReport)** for PR 20153 at commit [`4a6a725`](https://github.com/apache/spark/commit/4a6a725acffdc24f7c00302c1a0081c93f6acdd8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20153
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20153


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20153: [SPARK-22392][SQL] data source v2 columnar batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20153#discussion_r161257419
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala ---
    @@ -37,40 +35,58 @@ import org.apache.spark.sql.types.StructType
      */
     case class DataSourceV2ScanExec(
         fullOutput: Seq[AttributeReference],
    -    @transient reader: DataSourceV2Reader) extends LeafExecNode with DataSourceReaderHolder {
    +    @transient reader: DataSourceV2Reader)
    +  extends LeafExecNode with DataSourceReaderHolder with ColumnarBatchScan {
     
       override def canEqual(other: Any): Boolean = other.isInstanceOf[DataSourceV2ScanExec]
     
    -  override def references: AttributeSet = AttributeSet.empty
    +  override def producedAttributes: AttributeSet = AttributeSet(fullOutput)
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org