You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ueshin <gi...@git.apache.org> on 2017/08/16 06:13:58 UTC

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/18958

    [SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector.

    ## What changes were proposed in this pull request?
    
    This is a refactoring of `ColumnVector` hierarchy and related classes.
    
    1. make `ColumnVector` read-only
    2. introduce `MutableColumnVector` with write interface
    3. remove `ReadOnlyColumnVector`
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark issues/SPARK-21745

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18958.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18958
    
----
commit e4e22412c5ab23766a6908ec9e1a7931bcd52a54
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-08-15T04:09:16Z

    Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector.

commit cd0de397bba202cd5173e8aee0fc0bec2615295c
Author: Takuya UESHIN <ue...@databricks.com>
Date:   2017-08-15T04:38:32Z

    Modify VectorizedHashMapGenerator to use OnHeapColumnVector directly.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134399222
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,69 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      getColumnAsMutable(ordinal).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putBoolean(rowId, value);
         }
     
         @Override
         public void setByte(int ordinal, byte value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putByte(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    --- End diff --
    
    @ueshin @cloud-fan 
    Since `MutableColumnVector` in each column in `ColumnarBatch` is immutable, we can an array of `MutableColumnVector` by applying cast at initialization. If an cast exception occurs, we can ignore it since the column will not call setter APIs. Then, each setter in refers to an element of the array.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80918/testReport)** for PR 18958 at commit [`4d94655`](https://github.com/apache/spark/commit/4d94655b4695b16a2600e80824507e322af8ab00).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80979 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80979/testReport)** for PR 18958 at commit [`65cd681`](https://github.com/apache/spark/commit/65cd68184d4ae802d5fcac9a09fc780d3adf12b1).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    also cc @kiszk for another column vector pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80766/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134232538
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java ---
    @@ -970,30 +458,14 @@ public final int appendStruct(boolean isNull) {
       protected boolean anyNullsSet;
     
       /**
    -   * True if this column's values are fixed. This means the column values never change, even
    -   * across resets.
    -   */
    -  protected boolean isConstant;
    -
    -  /**
    -   * Default size of each array length value. This grows as necessary.
    -   */
    -  protected static final int DEFAULT_ARRAY_LENGTH = 4;
    -
    -  /**
    -   * Current write cursor (row index) when appending data.
    -   */
    -  protected int elementsAppended;
    -
    -  /**
        * If this is a nested type (array or struct), the column for the child data.
        */
       protected ColumnVector[] childColumns;
    --- End diff --
    
    yea, because mostly the child columns are of the same type of concrete column vector type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114861
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java ---
    @@ -0,0 +1,599 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.math.BigDecimal;
    +import java.math.BigInteger;
    +
    +import com.google.common.annotations.VisibleForTesting;
    +
    +import org.apache.spark.sql.internal.SQLConf;
    +import org.apache.spark.sql.types.*;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * This class adds write APIs to ColumnVector.
    + * It supports all the types and contains put APIs as well as their batched versions.
    + * The batched versions are preferable whenever possible.
    + *
    + * Capacity: The data stored is dense but the arrays are not fixed capacity. It is the
    + * responsibility of the caller to call reserve() to ensure there is enough room before adding
    + * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
    + * the lengths are known up front.
    + *
    + * A ColumnVector should be considered immutable once originally created. In other words, it is not
    --- End diff --
    
    How about `WritableColumnVector`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80918/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134393145
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    --- End diff --
    
    It seems like the rows returned by `ColumnarBatch.rowIterator` doesn't need to be mutable with our current tests, but `ColumnarBatch.Row` still needs to be mutable, the write apis of which are used in `HashAggregateExec`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80979 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80979/testReport)** for PR 18958 at commit [`65cd681`](https://github.com/apache/spark/commit/65cd68184d4ae802d5fcac9a09fc780d3adf12b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #81038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81038/testReport)** for PR 18958 at commit [`8330870`](https://github.com/apache/spark/commit/8330870ef18b12dfeb51e5003c68aaff9dabb7a3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114491
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java ---
    @@ -433,7 +434,8 @@ private void readBinaryBatch(int rowId, int num, ColumnVector column) throws IOE
       }
     
       private void readFixedLenByteArrayBatch(int rowId, int num,
    -                                          ColumnVector column, int arrayLen) throws IOException {
    +                                          MutableColumnVector column,
    +                                          int arrayLen) throws IOException {
    --- End diff --
    
    nit:
    ```
    private void xxx(
        para1: XX,
        para2: XX)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133622586
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala ---
    @@ -89,14 +91,23 @@ class VectorizedHashMapGenerator(
            |    $generatedAggBufferSchema
            |
            |  public $generatedClassName() {
    -       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    -       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    // TODO: Possibly generate this projection in HashAggregate directly
    -       |    aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    -       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
    -       |       aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
    +       |    batchVectors = new org.apache.spark.sql.execution.vectorized
    +       |      .OnHeapColumnVector[schema.fields().length];
    +       |    for (int i = 0; i < schema.fields().length; i++) {
    +       |      batchVectors[i] = new org.apache.spark.sql.execution.vectorized.OnHeapColumnVector(
    +       |        capacity, schema.fields()[i].dataType());
    +       |    }
    +       |    batch = new org.apache.spark.sql.execution.vectorized.ColumnarBatch(
    +       |      schema, batchVectors, capacity);
    +       |
    +       |    bufferVectors = new org.apache.spark.sql.execution.vectorized
    +       |      .OnHeapColumnVector[aggregateBufferSchema.fields().length];
    +       |    for (int i = 0; i < aggregateBufferSchema.fields().length; i++) {
    +       |      bufferVectors[i] = batchVectors[i + ${groupingKeys.length}];
            |    }
    +       |    // TODO: Possibly generate this projection in HashAggregate directly
    --- End diff --
    
    I'm sorry but I'm not sure because this is from original code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18958


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #81038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81038/testReport)** for PR 18958 at commit [`8330870`](https://github.com/apache/spark/commit/8330870ef18b12dfeb51e5003c68aaff9dabb7a3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133622350
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala ---
    @@ -89,14 +91,23 @@ class VectorizedHashMapGenerator(
            |    $generatedAggBufferSchema
            |
            |  public $generatedClassName() {
    -       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    -       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    // TODO: Possibly generate this projection in HashAggregate directly
    -       |    aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    -       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
    -       |       aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
    +       |    batchVectors = new org.apache.spark.sql.execution.vectorized
    --- End diff --
    
    Sure, I'll try it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133626997
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -505,18 +500,12 @@ public void filterNullsInColumn(int ordinal) {
         nullFilteredColumns.add(ordinal);
       }
     
    -  private ColumnarBatch(StructType schema, int maxRows, MemoryMode memMode) {
    +  public ColumnarBatch(StructType schema, ColumnVector[] columns, int capacity) {
         this.schema = schema;
    -    this.capacity = maxRows;
    -    this.columns = new ColumnVector[schema.size()];
    +    this.columns = columns;
    +    this.capacity = capacity;
    --- End diff --
    
    I found some places referring `ColumnarBatch.capacity()`, so I'd be a little conservative to do that for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134780644
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -505,18 +511,12 @@ public void filterNullsInColumn(int ordinal) {
         nullFilteredColumns.add(ordinal);
       }
     
    -  private ColumnarBatch(StructType schema, int maxRows, MemoryMode memMode) {
    +  public ColumnarBatch(StructType schema, ColumnVector[] columns, int capacity) {
    --- End diff --
    
    we can add a new constructor that accepts `WritableColumnVector[]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80979/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80962/testReport)** for PR 18958 at commit [`9eb88a8`](https://github.com/apache/spark/commit/9eb88a843594c0b27bf4884ee3d7e52b243e8070).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114520
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -172,20 +177,26 @@ public void initBatch(MemoryMode memMode, StructType partitionColumns,
           }
         }
     
    -    columnarBatch = ColumnarBatch.allocate(batchSchema, memMode);
    +    int capacity = ColumnarBatch.DEFAULT_BATCH_SIZE;
    +    if (memMode == MemoryMode.OFF_HEAP) {
    +      columnVectors = OffHeapColumnVector.allocateColumns(capacity, batchSchema);
    +    } else {
    +      columnVectors = OnHeapColumnVector.allocateColumns(capacity, batchSchema);
    --- End diff --
    
    seems we can have a `MutableColumnVector.allocateColumns(memMode)` for this logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134396135
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    --- End diff --
    
    oh, then we really need to think about how to eliminate the per-call type cast...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133618935
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java ---
    @@ -40,8 +39,43 @@
       private long lengthData;
       private long offsetData;
     
    -  protected OffHeapColumnVector(int capacity, DataType type) {
    -    super(capacity, type, MemoryMode.OFF_HEAP);
    +  public OffHeapColumnVector(int capacity, DataType type) {
    +    super(capacity, type);
    +
    +    if (type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType
    --- End diff --
    
    Sure, I'll try it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133418831
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,73 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    +      assert (columns[ordinal] instanceof MutableColumnVector);
           assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      ((MutableColumnVector) columns[ordinal]).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    +      assert (columns[ordinal] instanceof MutableColumnVector);
           assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      ((MutableColumnVector) columns[ordinal]).putNotNull(rowId);
    --- End diff --
    
    Maybe move the assertion and the cast in a private getter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by BryanCutler <gi...@git.apache.org>.
Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133579534
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -505,18 +500,12 @@ public void filterNullsInColumn(int ordinal) {
         nullFilteredColumns.add(ordinal);
       }
     
    -  private ColumnarBatch(StructType schema, int maxRows, MemoryMode memMode) {
    +  public ColumnarBatch(StructType schema, ColumnVector[] columns, int capacity) {
         this.schema = schema;
    -    this.capacity = maxRows;
    -    this.columns = new ColumnVector[schema.size()];
    +    this.columns = columns;
    +    this.capacity = capacity;
    --- End diff --
    
    Does `capacity` really mean anything in here anymore since the `ColumnVectors` are allocated and populated outside now?  Could we just initialize `this.numRows = 0` and delay initializing of `this.filteredRows` until `setNumRows()` is called?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133617360
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,73 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    +      assert (columns[ordinal] instanceof MutableColumnVector);
           assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      ((MutableColumnVector) columns[ordinal]).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    +      assert (columns[ordinal] instanceof MutableColumnVector);
           assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      ((MutableColumnVector) columns[ordinal]).putNotNull(rowId);
    --- End diff --
    
    Sure, I'll add a private getter and update these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80766 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80766/testReport)** for PR 18958 at commit [`b6ab633`](https://github.com/apache/spark/commit/b6ab63359e00d7fe0175204f191ff1baa10b789f).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81038/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80968/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134146811
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java ---
    @@ -970,30 +458,14 @@ public final int appendStruct(boolean isNull) {
       protected boolean anyNullsSet;
     
       /**
    -   * True if this column's values are fixed. This means the column values never change, even
    -   * across resets.
    -   */
    -  protected boolean isConstant;
    -
    -  /**
    -   * Default size of each array length value. This grows as necessary.
    -   */
    -  protected static final int DEFAULT_ARRAY_LENGTH = 4;
    -
    -  /**
    -   * Current write cursor (row index) when appending data.
    -   */
    -  protected int elementsAppended;
    -
    -  /**
        * If this is a nested type (array or struct), the column for the child data.
        */
       protected ColumnVector[] childColumns;
    --- End diff --
    
    can we move this to `WritableColumnVector`? I think `ColumnVector` only need `ColumnVector getChildColumn(int ordinal)`, and `WritableColumnVector` can overwrite it to `WritableColumnVector getChildColumn(int ordinal)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80775/testReport)** for PR 18958 at commit [`b6ab633`](https://github.com/apache/spark/commit/b6ab63359e00d7fe0175204f191ff1baa10b789f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133421832
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala ---
    @@ -89,14 +91,23 @@ class VectorizedHashMapGenerator(
            |    $generatedAggBufferSchema
            |
            |  public $generatedClassName() {
    -       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    -       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    // TODO: Possibly generate this projection in HashAggregate directly
    -       |    aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    -       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
    -       |       aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
    +       |    batchVectors = new org.apache.spark.sql.execution.vectorized
    --- End diff --
    
    This happens quite a few times. It might be better to create a static util method that creates the vectors for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133420801
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java ---
    @@ -491,6 +525,22 @@ public void loadBytes(ColumnVector.Array array) {
         array.byteArrayOffset = 0;
       }
     
    +  /**
    +   * Reserve a integer column for ids of dictionary.
    +   */
    +  @Override
    +  public OffHeapColumnVector reserveDictionaryIds(int capacity) {
    --- End diff --
    
    Same comment as in the constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80723/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134208734
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -0,0 +1,653 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.math.BigDecimal;
    +import java.math.BigInteger;
    +
    +import com.google.common.annotations.VisibleForTesting;
    +
    +import org.apache.spark.sql.internal.SQLConf;
    +import org.apache.spark.sql.types.*;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * This class adds write APIs to ColumnVector.
    + * It supports all the types and contains put APIs as well as their batched versions.
    + * The batched versions are preferable whenever possible.
    + *
    + * Capacity: The data stored is dense but the arrays are not fixed capacity. It is the
    + * responsibility of the caller to call reserve() to ensure there is enough room before adding
    + * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
    + * the lengths are known up front.
    + *
    + * A ColumnVector should be considered immutable once originally created. In other words, it is not
    + * valid to call put APIs after reads until reset() is called.
    + */
    +public abstract class WritableColumnVector extends ColumnVector {
    +
    +  /**
    +   * Resets this column for writing. The currently stored values are no longer accessible.
    +   */
    +  public void reset() {
    +    if (isConstant) return;
    +
    +    if (childColumns != null) {
    +      for (ColumnVector c: childColumns) {
    +        ((WritableColumnVector) c).reset();
    +      }
    +    }
    +    numNulls = 0;
    +    elementsAppended = 0;
    +    if (anyNullsSet) {
    +      putNotNulls(0, capacity);
    +      anyNullsSet = false;
    +    }
    +  }
    +
    +  public void reserve(int requiredCapacity) {
    +    if (requiredCapacity > capacity) {
    +      int newCapacity = (int) Math.min(MAX_CAPACITY, requiredCapacity * 2L);
    +      if (requiredCapacity <= newCapacity) {
    +        try {
    +          reserveInternal(newCapacity);
    +        } catch (OutOfMemoryError outOfMemoryError) {
    +          throwUnsupportedException(requiredCapacity, outOfMemoryError);
    +        }
    +      } else {
    +        throwUnsupportedException(requiredCapacity, null);
    +      }
    +    }
    +  }
    +
    +  private void throwUnsupportedException(int requiredCapacity, Throwable cause) {
    +    String message = "Cannot reserve additional contiguous bytes in the vectorized reader " +
    +        "(requested = " + requiredCapacity + " bytes). As a workaround, you can disable the " +
    +        "vectorized reader by setting " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() +
    +        " to false.";
    +    throw new RuntimeException(message, cause);
    +  }
    +
    +  /**
    +   * Ensures that there is enough storage to store capacity elements. That is, the put() APIs
    +   * must work for all rowIds < capacity.
    +   */
    +  protected abstract void reserveInternal(int capacity);
    +
    +  /**
    +   * Sets the value at rowId to null/not null.
    +   */
    +  public abstract void putNotNull(int rowId);
    +  public abstract void putNull(int rowId);
    +
    +  /**
    +   * Sets the values from [rowId, rowId + count) to null/not null.
    +   */
    +  public abstract void putNulls(int rowId, int count);
    +  public abstract void putNotNulls(int rowId, int count);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putBoolean(int rowId, boolean value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putBooleans(int rowId, int count, boolean value);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putByte(int rowId, byte value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putBytes(int rowId, int count, byte value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putBytes(int rowId, int count, byte[] src, int srcIndex);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putShort(int rowId, short value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putShorts(int rowId, int count, short value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putShorts(int rowId, int count, short[] src, int srcIndex);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putInt(int rowId, int value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putInts(int rowId, int count, int value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putInts(int rowId, int count, int[] src, int srcIndex);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count])
    +   * The data in src must be 4-byte little endian ints.
    +   */
    +  public abstract void putIntsLittleEndian(int rowId, int count, byte[] src, int srcIndex);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putLong(int rowId, long value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putLongs(int rowId, int count, long value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putLongs(int rowId, int count, long[] src, int srcIndex);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count])
    +   * The data in src must be 8-byte little endian longs.
    +   */
    +  public abstract void putLongsLittleEndian(int rowId, int count, byte[] src, int srcIndex);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putFloat(int rowId, float value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putFloats(int rowId, int count, float value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putFloats(int rowId, int count, float[] src, int srcIndex);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count])
    +   * The data in src must be ieee formatted floats.
    +   */
    +  public abstract void putFloats(int rowId, int count, byte[] src, int srcIndex);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract void putDouble(int rowId, double value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to value.
    +   */
    +  public abstract void putDoubles(int rowId, int count, double value);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
    +   */
    +  public abstract void putDoubles(int rowId, int count, double[] src, int srcIndex);
    +
    +  /**
    +   * Sets values from [rowId, rowId + count) to [src[srcIndex], src[srcIndex + count])
    +   * The data in src must be ieee formatted doubles.
    +   */
    +  public abstract void putDoubles(int rowId, int count, byte[] src, int srcIndex);
    +
    +  /**
    +   * Puts a byte array that already exists in this column.
    +   */
    +  public abstract void putArray(int rowId, int offset, int length);
    +
    +  /**
    +   * Sets the value at rowId to `value`.
    +   */
    +  public abstract int putByteArray(int rowId, byte[] value, int offset, int count);
    +  public final int putByteArray(int rowId, byte[] value) {
    +    return putByteArray(rowId, value, 0, value.length);
    +  }
    +
    +  /**
    +   * Returns the value for rowId.
    +   */
    +  private ColumnVector.Array getByteArray(int rowId) {
    +    ColumnVector.Array array = getArray(rowId);
    +    array.data.loadBytes(array);
    +    return array;
    +  }
    +
    +  /**
    +   * Returns the decimal for rowId.
    +   */
    +  @Override
    +  public Decimal getDecimal(int rowId, int precision, int scale) {
    +    if (precision <= Decimal.MAX_INT_DIGITS()) {
    +      return Decimal.createUnsafe(getInt(rowId), precision, scale);
    +    } else if (precision <= Decimal.MAX_LONG_DIGITS()) {
    +      return Decimal.createUnsafe(getLong(rowId), precision, scale);
    +    } else {
    +      // TODO: best perf?
    +      byte[] bytes = getBinary(rowId);
    +      BigInteger bigInteger = new BigInteger(bytes);
    +      BigDecimal javaDecimal = new BigDecimal(bigInteger, scale);
    +      return Decimal.apply(javaDecimal, precision, scale);
    +    }
    +  }
    +
    +  public void putDecimal(int rowId, Decimal value, int precision) {
    +    if (precision <= Decimal.MAX_INT_DIGITS()) {
    +      putInt(rowId, (int) value.toUnscaledLong());
    +    } else if (precision <= Decimal.MAX_LONG_DIGITS()) {
    +      putLong(rowId, value.toUnscaledLong());
    +    } else {
    +      BigInteger bigInteger = value.toJavaBigDecimal().unscaledValue();
    +      putByteArray(rowId, bigInteger.toByteArray());
    +    }
    +  }
    +
    +  /**
    +   * Returns the UTF8String for rowId.
    +   */
    +  @Override
    +  public UTF8String getUTF8String(int rowId) {
    +    if (dictionary == null) {
    +      ColumnVector.Array a = getByteArray(rowId);
    +      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +    } else {
    +      byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
    +      return UTF8String.fromBytes(bytes);
    +    }
    +  }
    +
    +  /**
    +   * Returns the byte array for rowId.
    +   */
    +  @Override
    +  public byte[] getBinary(int rowId) {
    +    if (dictionary == null) {
    +      ColumnVector.Array array = getByteArray(rowId);
    +      byte[] bytes = new byte[array.length];
    +      System.arraycopy(array.byteArray, array.byteArrayOffset, bytes, 0, bytes.length);
    +      return bytes;
    +    } else {
    +      return dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
    +    }
    +  }
    +
    +  /**
    +   * Append APIs. These APIs all behave similarly and will append data to the current vector.  It
    +   * is not valid to mix the put and append APIs. The append APIs are slower and should only be
    +   * used if the sizes are not known up front.
    +   * In all these cases, the return value is the rowId for the first appended element.
    +   */
    +  public final int appendNull() {
    +    assert (!(dataType() instanceof StructType)); // Use appendStruct()
    +    reserve(elementsAppended + 1);
    +    putNull(elementsAppended);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendNotNull() {
    +    reserve(elementsAppended + 1);
    +    putNotNull(elementsAppended);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendNulls(int count) {
    +    assert (!(dataType() instanceof StructType));
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putNulls(elementsAppended, count);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendNotNulls(int count) {
    +    assert (!(dataType() instanceof StructType));
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putNotNulls(elementsAppended, count);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendBoolean(boolean v) {
    +    reserve(elementsAppended + 1);
    +    putBoolean(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendBooleans(int count, boolean v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putBooleans(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendByte(byte v) {
    +    reserve(elementsAppended + 1);
    +    putByte(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendBytes(int count, byte v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putBytes(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendBytes(int length, byte[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putBytes(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendShort(short v) {
    +    reserve(elementsAppended + 1);
    +    putShort(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendShorts(int count, short v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putShorts(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendShorts(int length, short[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putShorts(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendInt(int v) {
    +    reserve(elementsAppended + 1);
    +    putInt(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendInts(int count, int v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putInts(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendInts(int length, int[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putInts(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendLong(long v) {
    +    reserve(elementsAppended + 1);
    +    putLong(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendLongs(int count, long v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putLongs(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendLongs(int length, long[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putLongs(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendFloat(float v) {
    +    reserve(elementsAppended + 1);
    +    putFloat(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendFloats(int count, float v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putFloats(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendFloats(int length, float[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putFloats(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendDouble(double v) {
    +    reserve(elementsAppended + 1);
    +    putDouble(elementsAppended, v);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendDoubles(int count, double v) {
    +    reserve(elementsAppended + count);
    +    int result = elementsAppended;
    +    putDoubles(elementsAppended, count, v);
    +    elementsAppended += count;
    +    return result;
    +  }
    +
    +  public final int appendDoubles(int length, double[] src, int offset) {
    +    reserve(elementsAppended + length);
    +    int result = elementsAppended;
    +    putDoubles(elementsAppended, length, src, offset);
    +    elementsAppended += length;
    +    return result;
    +  }
    +
    +  public final int appendByteArray(byte[] value, int offset, int length) {
    +    int copiedOffset = arrayData().appendBytes(length, value, offset);
    +    reserve(elementsAppended + 1);
    +    putArray(elementsAppended, copiedOffset, length);
    +    return elementsAppended++;
    +  }
    +
    +  public final int appendArray(int length) {
    +    reserve(elementsAppended + 1);
    +    putArray(elementsAppended, arrayData().elementsAppended, length);
    +    return elementsAppended++;
    +  }
    +
    +  /**
    +   * Appends a NULL struct. This *has* to be used for structs instead of appendNull() as this
    +   * recursively appends a NULL to its children.
    +   * We don't have this logic as the general appendNull implementation to optimize the more
    +   * common non-struct case.
    +   */
    +  public final int appendStruct(boolean isNull) {
    +    if (isNull) {
    +      appendNull();
    +      for (ColumnVector c: childColumns) {
    +        if (c.type instanceof StructType) {
    +          ((WritableColumnVector) c).appendStruct(true);
    +        } else {
    +          ((WritableColumnVector) c).appendNull();
    +        }
    +      }
    +    } else {
    +      appendNotNull();
    +    }
    +    return elementsAppended;
    +  }
    +
    +  /**
    +   * Returns the data for the underlying array.
    +   */
    +  @Override
    +  public WritableColumnVector arrayData() { return (WritableColumnVector) childColumns[0]; }
    +
    +  /**
    +   * Returns the ordinal's child data column.
    +   */
    +  @Override
    +  public WritableColumnVector getChildColumn(int ordinal) {
    +    return (WritableColumnVector) childColumns[ordinal];
    +  }
    +
    +  /**
    +   * Returns the elements appended.
    +   */
    +  public final int getElementsAppended() { return elementsAppended; }
    +
    +  /**
    +   * Marks this column as being constant.
    +   */
    +  public final void setIsConstant() { isConstant = true; }
    +
    +  /**
    +   * Upper limit for the maximum capacity for this column.
    +   */
    +  @VisibleForTesting
    +  protected int MAX_CAPACITY = Integer.MAX_VALUE;
    +
    +  /**
    +   * Default size of each array length value. This grows as necessary.
    +   */
    +  protected static final int DEFAULT_ARRAY_LENGTH = 4;
    +
    +  /**
    +   * Current write cursor (row index) when appending data.
    +   */
    +  protected int elementsAppended;
    +
    +  /**
    +   * True if this column's values are fixed. This means the column values never change, even
    +   * across resets.
    +   */
    +  protected boolean isConstant;
    +
    +  /**
    +   * Update the dictionary.
    +   */
    +  public void setDictionary(Dictionary dictionary) {
    +    this.dictionary = dictionary;
    +  }
    +
    +  /**
    +   * Reserve a integer column for ids of dictionary.
    +   */
    +  public WritableColumnVector reserveDictionaryIds(int capacity) {
    +    WritableColumnVector dictionaryIds = (WritableColumnVector) this.dictionaryIds;
    +    if (dictionaryIds == null) {
    +      dictionaryIds = reserveNewColumn(capacity, DataTypes.IntegerType);
    +      this.dictionaryIds = dictionaryIds;
    +    } else {
    +      dictionaryIds.reset();
    +      dictionaryIds.reserve(capacity);
    +    }
    +    return dictionaryIds;
    +  }
    +
    +  /**
    +   * Returns the underlying integer column for ids of dictionary.
    +   */
    +  @Override
    +  public WritableColumnVector getDictionaryIds() {
    +    return (WritableColumnVector) dictionaryIds;
    +  }
    +
    +  /**
    +   * Reserve a new column.
    +   */
    +  protected abstract WritableColumnVector reserveNewColumn(int capacity, DataType type);
    +
    +  /**
    +   * Initialize child columns.
    +   */
    +  protected void initialize() {
    --- End diff --
    
    We could move this method into the constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80987/testReport)** for PR 18958 at commit [`65cd681`](https://github.com/apache/spark/commit/65cd68184d4ae802d5fcac9a09fc780d3adf12b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    LGTM except some minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80723/testReport)** for PR 18958 at commit [`cd0de39`](https://github.com/apache/spark/commit/cd0de397bba202cd5173e8aee0fc0bec2615295c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133421918
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala ---
    @@ -89,14 +91,23 @@ class VectorizedHashMapGenerator(
            |    $generatedAggBufferSchema
            |
            |  public $generatedClassName() {
    -       |    batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
    -       |      org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    // TODO: Possibly generate this projection in HashAggregate directly
    -       |    aggregateBufferBatch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(
    -       |      aggregateBufferSchema, org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
    -       |    for (int i = 0 ; i < aggregateBufferBatch.numCols(); i++) {
    -       |       aggregateBufferBatch.setColumn(i, batch.column(i+${groupingKeys.length}));
    +       |    batchVectors = new org.apache.spark.sql.execution.vectorized
    +       |      .OnHeapColumnVector[schema.fields().length];
    +       |    for (int i = 0; i < schema.fields().length; i++) {
    +       |      batchVectors[i] = new org.apache.spark.sql.execution.vectorized.OnHeapColumnVector(
    +       |        capacity, schema.fields()[i].dataType());
    +       |    }
    +       |    batch = new org.apache.spark.sql.execution.vectorized.ColumnarBatch(
    +       |      schema, batchVectors, capacity);
    +       |
    +       |    bufferVectors = new org.apache.spark.sql.execution.vectorized
    +       |      .OnHeapColumnVector[aggregateBufferSchema.fields().length];
    +       |    for (int i = 0; i < aggregateBufferSchema.fields().length; i++) {
    +       |      bufferVectors[i] = batchVectors[i + ${groupingKeys.length}];
            |    }
    +       |    // TODO: Possibly generate this projection in HashAggregate directly
    --- End diff --
    
    Can you elaborate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133434818
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java ---
    @@ -0,0 +1,599 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.math.BigDecimal;
    +import java.math.BigInteger;
    +
    +import com.google.common.annotations.VisibleForTesting;
    +
    +import org.apache.spark.sql.internal.SQLConf;
    +import org.apache.spark.sql.types.*;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * This class adds write APIs to ColumnVector.
    + * It supports all the types and contains put APIs as well as their batched versions.
    + * The batched versions are preferable whenever possible.
    + *
    + * Capacity: The data stored is dense but the arrays are not fixed capacity. It is the
    + * responsibility of the caller to call reserve() to ensure there is enough room before adding
    + * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
    + * the lengths are known up front.
    + *
    + * A ColumnVector should be considered immutable once originally created. In other words, it is not
    --- End diff --
    
    This contradicts the name of this class. Maybe reuseable is a better way of describing what is going on here. Also cc @michal-databricks 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114822
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,69 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    --- End diff --
    
    just for curiosity, will we really update the rows returned by `ColumnarBatch.rowIterator`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80968/testReport)** for PR 18958 at commit [`9eb88a8`](https://github.com/apache/spark/commit/9eb88a843594c0b27bf4884ee3d7e52b243e8070).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80923/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134145753
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,69 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      getColumnAsMutable(ordinal).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putBoolean(rowId, value);
         }
     
         @Override
         public void setByte(int ordinal, byte value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putByte(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    --- End diff --
    
    In my understanding, cast still occurs at runtime. The cast operation may consist compare and branch.
    I am thinking about how we can reduce the cost of operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114762
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java ---
    @@ -1049,43 +511,9 @@ public ColumnVector getDictionaryIds() {
        * Sets up the common state and also handles creating the child columns if this is a nested
        * type.
        */
    -  protected ColumnVector(int capacity, DataType type, MemoryMode memMode) {
    +  protected ColumnVector(int capacity, DataType type) {
         this.capacity = capacity;
         this.type = type;
    -
    -    if (type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType
    -        || DecimalType.isByteArrayDecimalType(type)) {
    -      DataType childType;
    -      int childCapacity = capacity;
    -      if (type instanceof ArrayType) {
    -        childType = ((ArrayType)type).elementType();
    -      } else {
    -        childType = DataTypes.ByteType;
    -        childCapacity *= DEFAULT_ARRAY_LENGTH;
    -      }
    -      this.childColumns = new ColumnVector[1];
    -      this.childColumns[0] = ColumnVector.allocate(childCapacity, childType, memMode);
    -      this.resultArray = new Array(this.childColumns[0]);
    -      this.resultStruct = null;
    -    } else if (type instanceof StructType) {
    -      StructType st = (StructType)type;
    -      this.childColumns = new ColumnVector[st.fields().length];
    -      for (int i = 0; i < childColumns.length; ++i) {
    -        this.childColumns[i] = ColumnVector.allocate(capacity, st.fields()[i].dataType(), memMode);
    -      }
    -      this.resultArray = null;
    -      this.resultStruct = new ColumnarBatch.Row(this.childColumns);
    -    } else if (type instanceof CalendarIntervalType) {
    -      // Two columns. Months as int. Microseconds as Long.
    -      this.childColumns = new ColumnVector[2];
    -      this.childColumns[0] = ColumnVector.allocate(capacity, DataTypes.IntegerType, memMode);
    -      this.childColumns[1] = ColumnVector.allocate(capacity, DataTypes.LongType, memMode);
    -      this.resultArray = null;
    -      this.resultStruct = new ColumnarBatch.Row(this.childColumns);
    -    } else {
    -      this.childColumns = null;
    -      this.resultArray = null;
    -      this.resultStruct = null;
    -    }
    +    this.isConstant = true;
    --- End diff --
    
    I think `isConstant` should belong to `MutableColumnVector`, because it's used to indicate that this column vector should not be updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80775/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80766 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80766/testReport)** for PR 18958 at commit [`b6ab633`](https://github.com/apache/spark/commit/b6ab63359e00d7fe0175204f191ff1baa10b789f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80968 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80968/testReport)** for PR 18958 at commit [`9eb88a8`](https://github.com/apache/spark/commit/9eb88a843594c0b27bf4884ee3d7e52b243e8070).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80723/testReport)** for PR 18958 at commit [`cd0de39`](https://github.com/apache/spark/commit/cd0de397bba202cd5173e8aee0fc0bec2615295c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134148287
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java ---
    @@ -970,30 +458,14 @@ public final int appendStruct(boolean isNull) {
       protected boolean anyNullsSet;
     
       /**
    -   * True if this column's values are fixed. This means the column values never change, even
    -   * across resets.
    -   */
    -  protected boolean isConstant;
    -
    -  /**
    -   * Default size of each array length value. This grows as necessary.
    -   */
    -  protected static final int DEFAULT_ARRAY_LENGTH = 4;
    -
    -  /**
    -   * Current write cursor (row index) when appending data.
    -   */
    -  protected int elementsAppended;
    -
    -  /**
        * If this is a nested type (array or struct), the column for the child data.
        */
       protected ColumnVector[] childColumns;
    --- End diff --
    
    We need this field for `ArrowColumnVector` to store its child columns, too.
    Do you want to make the method `getChildColumn(int ordinal)` abstract and move the field to more concrete classes to manage by themselves?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80923/testReport)** for PR 18958 at commit [`4d94655`](https://github.com/apache/spark/commit/4d94655b4695b16a2600e80824507e322af8ab00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80962/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133419451
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java ---
    @@ -0,0 +1,599 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.math.BigDecimal;
    +import java.math.BigInteger;
    +
    +import com.google.common.annotations.VisibleForTesting;
    +
    +import org.apache.spark.sql.internal.SQLConf;
    +import org.apache.spark.sql.types.*;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * This class adds write APIs to ColumnVector.
    + * It supports all the types and contains put APIs as well as their batched versions.
    + * The batched versions are preferable whenever possible.
    + *
    + * Capacity: The data stored is dense but the arrays are not fixed capacity. It is the
    + * responsibility of the caller to call reserve() to ensure there is enough room before adding
    + * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
    + * the lengths are known up front.
    + *
    + * A ColumnVector should be considered immutable once originally created. In other words, it is not
    + * valid to call put APIs after reads until reset() is called.
    + */
    +public abstract class MutableColumnVector extends ColumnVector {
    +
    +  /**
    +   * Resets this column for writing. The currently stored values are no longer accessible.
    +   */
    +  @Override
    +  public void reset() {
    +    if (isConstant) return;
    +
    +    if (childColumns != null) {
    +      for (ColumnVector c: childColumns) {
    +        c.reset();
    +      }
    +    }
    +    numNulls = 0;
    +    elementsAppended = 0;
    +    if (anyNullsSet) {
    +      putNotNulls(0, capacity);
    +      anyNullsSet = false;
    +    }
    +  }
    +
    +  public void reserve(int requiredCapacity) {
    +    if (requiredCapacity > capacity) {
    +      int newCapacity = (int) Math.min(MAX_CAPACITY, requiredCapacity * 2L);
    +      if (requiredCapacity <= newCapacity) {
    +        try {
    +          reserveInternal(newCapacity);
    +        } catch (OutOfMemoryError outOfMemoryError) {
    +          throwUnsupportedException(requiredCapacity, outOfMemoryError);
    +        }
    +      } else {
    +        throwUnsupportedException(requiredCapacity, null);
    +      }
    +    }
    +  }
    +
    +  private void throwUnsupportedException(int requiredCapacity, Throwable cause) {
    +    String message = "Cannot reserve additional contiguous bytes in the vectorized reader " +
    +        "(requested = " + requiredCapacity + " bytes). As a workaround, you can disable the " +
    +        "vectorized reader by setting " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() +
    +        " to false.";
    +
    +    if (cause != null) {
    +      throw new RuntimeException(message, cause);
    --- End diff --
    
    You are allowed to pass `null` as a cause to the `RuntimeException` constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114461
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---
    @@ -450,14 +450,13 @@ class CodegenContext {
       /**
        * Returns the specialized code to set a given value in a column vector for a given `DataType`.
        */
    -  def setValue(batch: String, row: String, dataType: DataType, ordinal: Int,
    -      value: String): String = {
    +  def setValue(vector: String, row: String, dataType: DataType, value: String): String = {
    --- End diff --
    
    nit: `row` -> `rowId`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133618947
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java ---
    @@ -491,6 +525,22 @@ public void loadBytes(ColumnVector.Array array) {
         array.byteArrayOffset = 0;
       }
     
    +  /**
    +   * Reserve a integer column for ids of dictionary.
    +   */
    +  @Override
    +  public OffHeapColumnVector reserveDictionaryIds(int capacity) {
    --- End diff --
    
    Sure, I'll try it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    On a more generic level. We could also choose to make `ColumnVectors` immutable, and create builder classes to create (reusable) instances; this would create a better separation between the API's and make sure that the mutable vectors are not used incorrectly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134780277
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -95,19 +81,32 @@ public void close() {
         private final ColumnarBatch parent;
         private final int fixedLenRowSize;
         private final ColumnVector[] columns;
    +    private final WritableColumnVector[] writableColumns;
    --- End diff --
    
    can we move this to `ColumnarBatch` instead of `Row`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80775/testReport)** for PR 18958 at commit [`b6ab633`](https://github.com/apache/spark/commit/b6ab63359e00d7fe0175204f191ff1baa10b789f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80962/testReport)** for PR 18958 at commit [`9eb88a8`](https://github.com/apache/spark/commit/9eb88a843594c0b27bf4884ee3d7e52b243e8070).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134779475
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +306,69 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      getColumnAsWritable(ordinal).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putBoolean(rowId, value);
         }
     
         @Override
         public void setByte(int ordinal, byte value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putByte(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putByte(rowId, value);
         }
     
         @Override
         public void setShort(int ordinal, short value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putShort(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putShort(rowId, value);
         }
     
         @Override
         public void setInt(int ordinal, int value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putInt(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putInt(rowId, value);
         }
     
         @Override
         public void setLong(int ordinal, long value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putLong(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putLong(rowId, value);
         }
     
         @Override
         public void setFloat(int ordinal, float value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putFloat(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putFloat(rowId, value);
         }
     
         @Override
         public void setDouble(int ordinal, double value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putDouble(rowId, value);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putDouble(rowId, value);
         }
     
         @Override
         public void setDecimal(int ordinal, Decimal value, int precision) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putDecimal(rowId, value, precision);
    +      WritableColumnVector column = getColumnAsWritable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putDecimal(rowId, value, precision);
    +    }
    +
    +    private WritableColumnVector getColumnAsWritable(int ordinal) {
    --- End diff --
    
    nit: `getWritableColumn`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134114641
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java ---
    @@ -281,62 +255,14 @@ public Object get(int ordinal, DataType dataType) {
       /**
        * Resets this column for writing. The currently stored values are no longer accessible.
        */
    -  public void reset() {
    -    if (isConstant) return;
    -
    -    if (childColumns != null) {
    -      for (ColumnVector c: childColumns) {
    -        c.reset();
    -      }
    -    }
    -    numNulls = 0;
    -    elementsAppended = 0;
    -    if (anyNullsSet) {
    -      putNotNulls(0, capacity);
    -      anyNullsSet = false;
    -    }
    -  }
    +  public abstract void reset();
    --- End diff --
    
    if `ColumnVector` is read-only, why we need a `reset` API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80720/testReport)** for PR 18958 at commit [`cd0de39`](https://github.com/apache/spark/commit/cd0de397bba202cd5173e8aee0fc0bec2615295c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133617361
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/MutableColumnVector.java ---
    @@ -0,0 +1,599 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.vectorized;
    +
    +import java.math.BigDecimal;
    +import java.math.BigInteger;
    +
    +import com.google.common.annotations.VisibleForTesting;
    +
    +import org.apache.spark.sql.internal.SQLConf;
    +import org.apache.spark.sql.types.*;
    +import org.apache.spark.unsafe.types.UTF8String;
    +
    +/**
    + * This class adds write APIs to ColumnVector.
    + * It supports all the types and contains put APIs as well as their batched versions.
    + * The batched versions are preferable whenever possible.
    + *
    + * Capacity: The data stored is dense but the arrays are not fixed capacity. It is the
    + * responsibility of the caller to call reserve() to ensure there is enough room before adding
    + * elements. This means that the put() APIs do not check as in common cases (i.e. flat schemas),
    + * the lengths are known up front.
    + *
    + * A ColumnVector should be considered immutable once originally created. In other words, it is not
    + * valid to call put APIs after reads until reset() is called.
    + */
    +public abstract class MutableColumnVector extends ColumnVector {
    +
    +  /**
    +   * Resets this column for writing. The currently stored values are no longer accessible.
    +   */
    +  @Override
    +  public void reset() {
    +    if (isConstant) return;
    +
    +    if (childColumns != null) {
    +      for (ColumnVector c: childColumns) {
    +        c.reset();
    +      }
    +    }
    +    numNulls = 0;
    +    elementsAppended = 0;
    +    if (anyNullsSet) {
    +      putNotNulls(0, capacity);
    +      anyNullsSet = false;
    +    }
    +  }
    +
    +  public void reserve(int requiredCapacity) {
    +    if (requiredCapacity > capacity) {
    +      int newCapacity = (int) Math.min(MAX_CAPACITY, requiredCapacity * 2L);
    +      if (requiredCapacity <= newCapacity) {
    +        try {
    +          reserveInternal(newCapacity);
    +        } catch (OutOfMemoryError outOfMemoryError) {
    +          throwUnsupportedException(requiredCapacity, outOfMemoryError);
    +        }
    +      } else {
    +        throwUnsupportedException(requiredCapacity, null);
    +      }
    +    }
    +  }
    +
    +  private void throwUnsupportedException(int requiredCapacity, Throwable cause) {
    +    String message = "Cannot reserve additional contiguous bytes in the vectorized reader " +
    +        "(requested = " + requiredCapacity + " bytes). As a workaround, you can disable the " +
    +        "vectorized reader by setting " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() +
    +        " to false.";
    +
    +    if (cause != null) {
    +      throw new RuntimeException(message, cause);
    --- End diff --
    
    Thanks. I'll update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    cc @cloud-fan @BryanCutler 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80918/testReport)** for PR 18958 at commit [`4d94655`](https://github.com/apache/spark/commit/4d94655b4695b16a2600e80824507e322af8ab00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134147099
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    --- End diff --
    
    one question, does the rows returned by `ColumnarBatch.rowIterator` have to be mutable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80987/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80987/testReport)** for PR 18958 at commit [`65cd681`](https://github.com/apache/spark/commit/65cd68184d4ae802d5fcac9a09fc780d3adf12b1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r134115047
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java ---
    @@ -307,64 +293,69 @@ public void update(int ordinal, Object value) {
     
         @Override
         public void setNullAt(int ordinal) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNull(rowId);
    +      getColumnAsMutable(ordinal).putNull(rowId);
         }
     
         @Override
         public void setBoolean(int ordinal, boolean value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putBoolean(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    +      column.putNotNull(rowId);
    +      column.putBoolean(rowId, value);
         }
     
         @Override
         public void setByte(int ordinal, byte value) {
    -      assert (!columns[ordinal].isConstant);
    -      columns[ordinal].putNotNull(rowId);
    -      columns[ordinal].putByte(rowId, value);
    +      MutableColumnVector column = getColumnAsMutable(ordinal);
    --- End diff --
    
    I'm a little afraid about this per-call type cast, but JVM should be able to optimize it perfectly, cc @kiszk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80720/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80923/testReport)** for PR 18958 at commit [`4d94655`](https://github.com/apache/spark/commit/4d94655b4695b16a2600e80824507e322af8ab00).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18958#discussion_r133420728
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java ---
    @@ -40,8 +39,43 @@
       private long lengthData;
       private long offsetData;
     
    -  protected OffHeapColumnVector(int capacity, DataType type) {
    -    super(capacity, type, MemoryMode.OFF_HEAP);
    +  public OffHeapColumnVector(int capacity, DataType type) {
    +    super(capacity, type);
    +
    +    if (type instanceof ArrayType || type instanceof BinaryType || type instanceof StringType
    --- End diff --
    
    Can you try to move this initialization logic into the parent class? We should be able to factor out the on/off-heap specific initialization logic into a separate method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    **[Test build #80720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80720/testReport)** for PR 18958 at commit [`cd0de39`](https://github.com/apache/spark/commit/cd0de397bba202cd5173e8aee0fc0bec2615295c).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/18958
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org