You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by jiangxb1987 <gi...@git.apache.org> on 2018/01/23 08:23:07 UTC

[GitHub] spark pull request #20361: [SPARK-23188] [SQL] Make vectorized columar reade...

GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/20361

    [SPARK-23188] [SQL] Make vectorized columar reader batch size configurable

    ## What changes were proposed in this pull request?
    
    This PR include the following changes:
    - Make the capacity of `VectorizedParquetRecordReader` configurable;
    - Make the capacity of `OrcColumnarBatchReader` configurable;
    - Update the error message when required capacity in writable columnar vector cannot be fulfilled.
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark vectorCapacity

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20361
    
----
commit 927c6b4d16b5a4c6457a190f3c1b2b8a5e439f2a
Author: Xingbo Jiang <xi...@...>
Date:   2018-01-23T08:14:33Z

    make vector batch size configurable.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/156/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/452/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86522/testReport)** for PR 20361 at commit [`927c6b4`](https://github.com/apache/spark/commit/927c6b4d16b5a4c6457a190f3c1b2b8a5e439f2a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86547/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86522/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86547/testReport)** for PR 20361 at commit [`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86542/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    LGTM, pending jenkins


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164634309
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -377,6 +377,12 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize")
    --- End diff --
    
    I'd prefer `spark.sql.parquet.columnarReaderBatchSize` to be more clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164634339
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -400,6 +406,12 @@ object SQLConf {
         .booleanConf
         .createWithDefault(true)
     
    +  val ORC_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.orc.batchSize")
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86542/testReport)** for PR 20361 at commit [`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20361


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Not a bug fix. This is not qualified for merging to Spark 2.3 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86542/testReport)** for PR 20361 at commit [`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r163918684
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -115,13 +116,15 @@
        */
       private final MemoryMode MEMORY_MODE;
     
    -  public VectorizedParquetRecordReader(TimeZone convertTz, boolean useOffHeap) {
    +  public VectorizedParquetRecordReader(TimeZone convertTz, boolean useOffHeap, int capacity) {
         this.convertTz = convertTz;
         MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP;
    +    this.capacity = capacity;
       }
     
    +  // Vectorized parquet reader used for testing and benchmark.
       public VectorizedParquetRecordReader(boolean useOffHeap) {
    -    this(null, useOffHeap);
    +    this(null, useOffHeap, 4096);
    --- End diff --
    
    How about changing benchmark and test programs to pass capacity and remove this constructor?
    These programs also have `SQLConf`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164634402
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java ---
    @@ -49,8 +49,9 @@
      * After creating, `initialize` and `initBatch` should be called sequentially.
      */
     public class OrcColumnarBatchReader extends RecordReader<Void, ColumnarBatch> {
    -  // TODO: make this configurable.
    -  private static final int CAPACITY = 4 * 1024;
    +
    +  // The default size of vectorized batch.
    --- End diff --
    
    maybe we can remove the comment. It's just the capacity, not a default value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86901/testReport)** for PR 20361 at commit [`5ad935f`](https://github.com/apache/spark/commit/5ad935f28dae3d8879c0f65711f9b0861c993a24).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86901/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164634591
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -115,13 +116,15 @@
        */
       private final MemoryMode MEMORY_MODE;
     
    -  public VectorizedParquetRecordReader(TimeZone convertTz, boolean useOffHeap) {
    +  public VectorizedParquetRecordReader(TimeZone convertTz, boolean useOffHeap, int capacity) {
         this.convertTz = convertTz;
         MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP;
    +    this.capacity = capacity;
       }
     
    +  // Vectorized parquet reader used for testing and benchmark.
       public VectorizedParquetRecordReader(boolean useOffHeap) {
    -    this(null, useOffHeap);
    +    this(null, useOffHeap, 4096);
    --- End diff --
    
    It's good to avoid hardcoding the default value again in the code. If there are only a few places need to be changed, let's do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/152/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r165234841
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java ---
    @@ -49,8 +49,9 @@
      * After creating, `initialize` and `initBatch` should be called sequentially.
      */
     public class OrcColumnarBatchReader extends RecordReader<Void, ColumnarBatch> {
    -  // TODO: make this configurable.
    -  private static final int CAPACITY = 4 * 1024;
    +
    +  // The default size of vectorized batch.
    --- End diff --
    
    How about rephrase to `The capacity of vectorized batch` ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86522/testReport)** for PR 20361 at commit [`927c6b4`](https://github.com/apache/spark/commit/927c6b4d16b5a4c6457a190f3c1b2b8a5e439f2a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86547/testReport)** for PR 20361 at commit [`38debd7`](https://github.com/apache/spark/commit/38debd7957fc2376b92cac5ae6ad1b0b78fb33c2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/134/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Hi, All.
    Can we have this in Spark 2.3, too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r165242969
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala ---
    @@ -40,7 +40,9 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
             List.fill(n)(ROW).toDF.repartition(1).write.parquet(dir.getCanonicalPath)
             val file = SpecificParquetRecordReaderBase.listDirectory(dir).toArray.head
     
    -        val reader = new VectorizedParquetRecordReader(sqlContext.conf.offHeapColumnVectorEnabled)
    +        val conf = sqlContext.conf
    --- End diff --
    
    nit: `val capacity = sqlContext.conf. parquetVectorizedReaderBatchSize `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164650445
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -377,6 +377,12 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize")
    --- End diff --
    
    Still a question. Is that possible to use the estimated memory size instead of the number of rows?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    **[Test build #86901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86901/testReport)** for PR 20361 at commit [`5ad935f`](https://github.com/apache/spark/commit/5ad935f28dae3d8879c0f65711f9b0861c993a24).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20361: [SPARK-23188][SQL] Make vectorized columar reader...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20361#discussion_r164685543
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -377,6 +377,12 @@ object SQLConf {
           .booleanConf
           .createWithDefault(true)
     
    +  val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize")
    --- End diff --
    
    I'd say it's very hard. If we need to satisfy a sizeInBytes limitation, we would need to load data record by record, and stop loading if we hit the limitation. But for performance reasons, we wanna load the data with batch, which needs to know the batch size ahead.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20361: [SPARK-23188][SQL] Make vectorized columar reader batch ...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20361
  
    cc @cloud-fan @sameeragarwal 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org