You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kiszk <gi...@git.apache.org> on 2017/05/17 08:58:00 UTC

[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

GitHub user kiszk opened a pull request:

    https://github.com/apache/spark/pull/18014

    [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeArrayData for array

    ## What changes were proposed in this pull request?
    
    This PR enhances `ColumnVector` to keep `UnsafeArrayData` for array to use `ColumnVector` for table cache (e.g. CACHE table, DataFrame.cache).
    
    Current `ColumnVector` accepts only primitive-type Java array as an input for array. It is good to keep data from Parquet.
    
    This PR changed or added the following APIs:
    
    `ColumnVector ColumnVector.allocate(int capacity, DataType type, MemoryMode mode, boolean useUnsafeArrayData)`
    * When the last is true, the `ColumnVector` can keep `UnsafeArrayData`. If it is false, the `ColumnVector` cannot keep `UnsafeArrayData`. 
    
    `int ColumnVector.putArray(int rowId, ArrayData array)`
    * When this `ColumnVector` was generated with `useUnsafeArrayData=true`, this method stores `UnsafeArrayData` into `ColumnVector`. Otherwise, throw an exception.
    
    `ArrayData ColumnVector.getArray(int rowId)`
    * When this `ColumnVector` was generated with `useUnsafeArrayData=true`, this method returns  `UnsafeArrayData`.
    
    ## How was this patch tested?
    
    Update existing testsuite


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark OnHeapColumnVector

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18014.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18014
    
----
commit b939a0819ac9feb4ff779f50a19ceb101cada999
Author: Kazuaki Ishizaki <is...@jp.ibm.com>
Date:   2017-04-24T16:56:48Z

    Keep UnsafeArrayData for Array in ColumnVector

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77024/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @cloud-fan Could you please let us know your thoughts? 
    Is it better to use binary type or to add simple logic for `UnsafeArrayData` and others in `ColumnVector`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Yes. Let me implement new `putArray(int rowId, Array array)` that uses `ColumnVector.Array` and stores primitive-type elements into a primitive array (e.g. `intData`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77023/testReport)** for PR 18014 at commit [`b939a08`](https://github.com/apache/spark/commit/b939a0819ac9feb4ff779f50a19ceb101cada999).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77024 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77024/testReport)** for PR 18014 at commit [`3f726ba`](https://github.com/apache/spark/commit/3f726ba8ff6d37c5b9b25590b5d76a1376352aa3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77024 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77024/testReport)** for PR 18014 at commit [`3f726ba`](https://github.com/apache/spark/commit/3f726ba8ff6d37c5b9b25590b5d76a1376352aa3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r120933598
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    ping @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk closed the pull request at:

    https://github.com/apache/spark/pull/18014


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r120311436
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    @cloud-fan I agree that `ColumnVector.Array.getArray` returns the same instance in `ColumnVector.resultArray`. This is good for one-dimensional array.
    In my understanding, we have to get `ColumnVector` from this instance to supporting nested array. However, current `ColumnVector.Array` does not have getter for `ColumnVector`. It is hard to implement nested array in current `ColumnVector` implementation. I correct? Or, do you have any idea to enhance `ColumnVector`?
    
    I would appreciate it if we have a F2F chat at Moscone center on Tuesday.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r119186654
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    I don't think so, `ColumnVector.Array.getArray` actually return the same instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118660658
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    Regarding `UnsafeArrayData`, I understood what you are talking about. As you said, leaf element are not together as shown below.
    ```
    dimension |[0][0 ... n-1]       |[1][0 ... n-1]       |[2][0 ... n-1]       |
    memory    |[null bits][elements]|[null bits][elements]|[null bits][elements]|
    ```
    
    On the other hand, for `ColumnVector`, IIUC, `ColumnarBatchSuite.String APIs` uses [single dimension array](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala#L646-L650).
    For the nested array, the following access is required. In this case, `array0` and `array1` are difference instances. Thus, leaf elements are not together.
    ```
    val arrayRow0 : ColumnVector.Array = columnVector.getArray(0)
    val array0 : ColumnVector.Array = arrayRow0.getArray(0)
    val element00 : int = array0.getInt(0)
    val array1 : ColumnVector.Array = arrayRow0.getArray(1)
    val element01 : int = array1.getInt(1)
    ```
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118622572
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    You are right, current implementation will require do multiple bulk copies for a nested array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77023/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    I think there is a gap between columnar format and the unsafe row format. The current `ColumnVector` format looks reasonable for array type, as it puts the leaf elements together, which is better for compression.
    
    While changing it to binary makes it faster if we need to read an array from `ColumnVector` and convert it to `ArrayData`, it's really an efficient way to store arrays in columnar format. Shall we consider updating`UnsafeArrayData` format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77385/testReport)** for PR 18014 at commit [`9954d6b`](https://github.com/apache/spark/commit/9954d6b20d3ed5743100d342b8b8102f60f8f9f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77017/testReport)** for PR 18014 at commit [`b939a08`](https://github.com/apache/spark/commit/b939a0819ac9feb4ff779f50a19ceb101cada999).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @cloud-fan Here is an implementation based on Option 2 only for simple data types (e.g. boolean, int, double, and so on). Used bulk-copy for array body in `putArray()`, and used [element-wise copy for null bit if `containsNull()` is true](https://github.com/apache/spark/pull/18014/files#diff-7738d15a778268fd8b5574e2c655a660R683) in `putArray()`.
    Do you have any comments/feedbacks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @cloud-fan Thank you for your comments. Let me confirm your ideas. 
    1. Do you want to keep array contents in [a primitive data array (e.g. intData[])](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L43) of `UnsafeArrayData`?
    2. How do you want to update `UnsafeArrayData`?
    
    We can map the current `UnsafeArrayData` into `ColumnVector`. The following is the format of `UnsafeArrayData`.
    ```
     [numElements][null bits][values or offset&length][variable length portion]
    ```
    * numElements: store it into `arrayLengths[]`
    * [null bits]: ***Need to conversion from bitvector representation to byte representation***
    * [values]: store as each data type
    * [offset&length][variable length portion]: store as `ByteType`
    
    The issue is for conversion of `null bits`. However, if we use byte representation in `UnsafeArrayData`, it may waste more memory space. To avoid this, we could update `ColumnVector` to support bitvector representation for nullability of each element.
    
    
    On the other hand, current my approach stores the whole `UnsafeArrayData` as a binary into [`byte[] data`](https://github.com/apache/spark/pull/18014/files#diff-f1e0f2d99a6cdc0113487f8358861fb3R56). The advantage of this approach is no cost for conversion at put/get.
    
    What do you think?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @cloud-fan What would you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77023/testReport)** for PR 18014 at commit [`b939a08`](https://github.com/apache/spark/commit/b939a0819ac9feb4ff779f50a19ceb101cada999).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    I took a look at `ColumnVector.getArray`, seems it's already no cost? The writing needs some copy though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77385/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    I may miss something, can we just treat array type as binary type and put it in `ColumnVector`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    I think option 2 is better. `ColumnVector.getArray()` should be as fast as possible. The caller side may just get an element from this array and then the final projection doesn't need to copy an array.
    
    But if we do need to copy the array in the final projection, we should also speed it up by bulk copy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77384/testReport)** for PR 18014 at commit [`bf6ab20`](https://github.com/apache/spark/commit/bf6ab2060550671c59e013a3c56accdd698df9ee).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77017/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @cloud-fan When I think about the use case of `ColumnVector.getArray` (i.e. in generated code by the whole-stage code geneneration), I think that it is better to return `UnsafeArrayData` instead of `ColumnVector.Array` in the currnet generated code.
    
    The following is an example program and current generated code. In the generated code, we will replace `inputadapter_row.getArray(0)` at line 35 with `columnVector.getArray(rowIdx)`. So, let us assume `columnVector.getArray(rowIdx)` is used and focus on `inputadapter_value`.
    
    At projection, if a type of `inputadapter_value` is `UnsafeArrayData`, line 72 just performs fast memory copy for a contigious region. On the other hand, if a type of `inputadapter_value` is `ColumnVector.Array`, lines 79-86 performs element-wise copy that is slower.
    
    I think that there are three options.
    1. Add a method `UnsafeArrayData ColumnVector.getArray()` to use `UnsafeArrayData` anywhere in the generated code.
    2. Add a conditional branch for `ColumnVector.Array` in the generated code and prepare specialized copy routine for `ColumnVector.Array`. In this case, we can use bulk copy for array body, but use element-wise copy for null bits.
    3. In addition to 2, we support bit-wise null bits representation in `ColumnVector.Array`. In this case, we can use two bulk copy. One is for null bits, and the other is for array body. Downside of this approach is to introduce additional conditional branch at ColumnVector.isNullAt() (i.e. check whether we use byte-wise or bit-wise null representation).
    
    What do you think? Or, are there any other ideas?
    
    
    ```java
    val df = sparkContext.parallelize(Seq(Array(0, 1), Array(1, 2)), 1).toDF("a").cache
    df.count
    df.filter("a[0] > 0").show
    ```    
    
    
    ```java
    /* 031 */   protected void processNext() throws java.io.IOException {
    /* 032 */     while (inputadapter_input.hasNext() && !stopEarly()) {
    /* 033 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
    /* 034 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
    /* 035 */       ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
    /* 037 */       if (!(!(inputadapter_isNull))) continue;
    /* 039 */       boolean filter_isNull2 = true;
    /* 040 */       boolean filter_value2 = false;
    /* 042 */       boolean filter_isNull3 = true;
    /* 043 */       int filter_value3 = -1;
    /* 045 */       filter_isNull3 = false;
    /* 047 */       final int filter_index = (int) 0;
    /* 048 */       if (filter_index >= inputadapter_value.numElements() || filter_index < 0 || inputadapter_value.isNullAt(filter_index)) {
    /* 049 */         filter_isNull3 = true;
    /* 050 */       } else {
    /* 051 */         filter_value3 = inputadapter_value.getInt(filter_index);
    /* 052 */       }
    /* 053 */       if (!filter_isNull3) {
    /* 054 */         filter_isNull2 = false;
    /* 055 */         filter_value2 = filter_value3 > 0;
    /* 057 */       }
    /* 058 */       if (filter_isNull2 || !filter_value2) continue;
    /* 060 */       filter_numOutputRows.add(1);
    /* 062 */       filter_holder.reset();
    /* 066 */       final int filter_tmpCursor = filter_holder.cursor;
    /* 068 */       if (inputadapter_value instanceof UnsafeArrayData) {
    /* 069 */         final int filter_sizeInBytes = ((UnsafeArrayData) inputadapter_value).getSizeInBytes();
    /* 071 */         filter_holder.grow(filter_sizeInBytes);
    /* 072 */         ((UnsafeArrayData) inputadapter_value).writeToMemory(filter_holder.buffer, filter_holder.cursor);
    /* 073 */         filter_holder.cursor += filter_sizeInBytes;
    /* 075 */       } else {
    /* 076 */         final int filter_numElements = inputadapter_value.numElements();
    /* 077 */         filter_arrayWriter.initialize(filter_holder, filter_numElements, 4);
    /* 079 */         for (int filter_index1 = 0; filter_index1 < filter_numElements; filter_index1++) {
    /* 080 */           if (inputadapter_value.isNullAt(filter_index1)) {
    /* 081 */             filter_arrayWriter.setNullInt(filter_index1);
    /* 082 */           } else {
    /* 083 */             final int filter_element = inputadapter_value.getInt(filter_index1);
    /* 084 */             filter_arrayWriter.write(filter_index1, filter_element);
    /* 085 */           }
    /* 086 */         }
    /* 087 */       }
    /* 089 */       filter_rowWriter.setOffsetAndSize(0, filter_tmpCursor, filter_holder.cursor - filter_tmpCursor);
    /* 090 */       filter_result.setTotalSize(filter_holder.totalSize());
    /* 091 */       append(filter_result);
    /* 092 */       if (shouldStop()) return;
    /* 093 */     }
    /* 094 */   }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    @hvanhovell @sameeragarwal would it be possible to look at this?
    cc: @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118621670
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    if we wanna support nested arrays, do we need to do multiple bulk copies?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r119182758
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    ping @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77385/testReport)** for PR 18014 at commit [`9954d6b`](https://github.com/apache/spark/commit/9954d6b20d3ed5743100d342b8b8102f60f8f9f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118624684
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    no I think it's the opposite. `UnsafeArrayData` will blindly put elements together, if the element is also array type, each element will contain null bits, so the leaf element is not together. For `ColumnVector`, you can take a look at `ColumnarBatchSuite.String APIs`, the leaf elements are together


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r119808220
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    Let me think about this over weekend since I am busy to prepare a slide for SparkSummit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    It is older one. Let me close this. Then, I will submit another PR very soon to do the same thing in different way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77017/testReport)** for PR 18014 at commit [`b939a08`](https://github.com/apache/spark/commit/b939a0819ac9feb4ff779f50a19ceb101cada999).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118621775
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    I think we may need to update `UnsafeArrayData` to put leaf elements together like `ColumnVector` did, then we can just use one bulk copy to write a nested array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep U...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18014#discussion_r118622945
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -386,6 +425,35 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    +  public void putArray(int rowId, Object src, int srcOffset, int dstOffset, int numElements) {
    +    DataType et = type;
    +    reserve(dstOffset + numElements);
    +    if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, byteData, Platform.BYTE_ARRAY_OFFSET + dstOffset, numElements);
    +    } else if (et == DataTypes.BooleanType || et == DataTypes.ByteType) {
    +      Platform.copyMemory(
    +        src, srcOffset, shortData, Platform.SHORT_ARRAY_OFFSET + dstOffset * 2, numElements * 2);
    +    } else if (et == DataTypes.IntegerType || et == DataTypes.DateType ||
    +      DecimalType.is32BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, intData, Platform.INT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (type instanceof LongType || type instanceof TimestampType ||
    +      DecimalType.is64BitDecimalType(type)) {
    +      Platform.copyMemory(
    +        src, srcOffset, longData, Platform.LONG_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else if (et == DataTypes.FloatType) {
    +      Platform.copyMemory(
    +        src, srcOffset, floatData, Platform.FLOAT_ARRAY_OFFSET + dstOffset * 4, numElements * 4);
    +    } else if (et == DataTypes.DoubleType) {
    +      Platform.copyMemory(
    +        src, srcOffset, doubleData, Platform.DOUBLE_ARRAY_OFFSET + dstOffset * 8, numElements * 8);
    +    } else {
    +      throw new RuntimeException("Unhandled " + type);
    --- End diff --
    
    IIUC, `UnsafeArrayData` put all of the elements together in a contiguous memory region even when it is a nested array.
    On the other hand, current `ColumnVector` puts elements for each dimension in different `ColumnVector.childColumns[0]`. Thus, multiple bulk copies are required.
    If there is an array `Array(Array(1), Array(2))`, in `ColumnVector`, `Array(1)` and `Array(2)` are stored into different `intData` in `ColumnVector.childColumns[0]`.
    Am I correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77384/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    **[Test build #77384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77384/testReport)** for PR 18014 at commit [`bf6ab20`](https://github.com/apache/spark/commit/bf6ab2060550671c59e013a3c56accdd698df9ee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    I thought that idea is for Apache Arrow.
    We could use binary type for `UnsafeArrayData`. However, it involves some complexity to use [`ColumnVector.Array`](https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java#L1015-L1017).
    
    Is it better to use existing code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18014
  
    What is the latest status of this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org