You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/11/24 16:10:33 UTC

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19815

    [SPARK-22602][SQL] remove ColumnVector#loadBytes

    ## What changes were proposed in this pull request?
    
    `ColumnVector#loadBytes` is only used as an optimization for reading UTF8String in `WritableColumnVector`, this PR moves this optimization to `WritableColumnVector` and simplified it.
    
    ## How was this patch tested?
    
    existing test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark load-bytes

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19815.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19815
    
----
commit ae7db88d8e7e6705058121782df825a8a2eb965d
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-11-24T16:07:55Z

    remove ColumnVector#loadBytes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84172/testReport)** for PR 19815 at commit [`3a59b32`](https://github.com/apache/spark/commit/3a59b32f2582ea5b734ebd94c5dd83c3cf1af625).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84170/testReport)** for PR 19815 at commit [`ae7db88`](https://github.com/apache/spark/commit/ae7db88d8e7e6705058121782df825a8a2eb965d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153082230
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
    --- End diff --
    
    This is a bit of a non-issue: the current on-heap code path already avoids making copies.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    LGTM pending Jenkins.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153082248
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
    --- End diff --
    
    It might be a bit better to use `arrayData()` instead of `childColumns[0]`, they are practically the same, but it makes the intent a bit clearer.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84172/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153082266
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
         } else {
           byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
           return UTF8String.fromBytes(bytes);
         }
       }
     
    +  /**
    +   * Returns a UTF8String whose data comes from [rowId, rowId + count] of this vector.
    +   * This method is similar to {@link ColumnVector#getBytes(int, int)}, but can save data copy as
    +   * UTF8String is used as a pointer.
    +   */
    +  protected abstract UTF8String getUTF8String0(int rowId, int count);
    +
       /**
        * Returns the byte array for rowId.
        */
       @Override
       public byte[] getBinary(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray array = getByteArray(rowId);
    -      byte[] bytes = new byte[array.length];
    -      System.arraycopy(array.byteArray, array.byteArrayOffset, bytes, 0, bytes.length);
    -      return bytes;
    +      return childColumns[0].getBytes(getArrayOffset(rowId), getArrayLength(rowId));
    --- End diff --
    
    Same here, use `arrayData()`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84203/testReport)** for PR 19815 at commit [`5711bb2`](https://github.com/apache/spark/commit/5711bb20e9d598c8716d7d8636466ee82e623a20).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153060336
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
    --- End diff --
    
    Shall we add comment that `getUTF8String` reuse the data in column vector? It seems different than other getXXX APIs


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19815


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153060413
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
         } else {
           byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
           return UTF8String.fromBytes(bytes);
    --- End diff --
    
    hmm, but looks `decodeToBinary` will copy byte data?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84203/testReport)** for PR 19815 at commit [`5711bb2`](https://github.com/apache/spark/commit/5711bb20e9d598c8716d7d8636466ee82e623a20).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153080542
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
    --- End diff --
    
    It looks risky if we do not make a copy. 
    
    If we plan to avoid the unnecessary data copy by this API, we should rename the API name `getUTF8String` and check all the callers whether they do not break the assumption.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    I will look this Sunday.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    cc @michal-databricks @kiszk @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84172/testReport)** for PR 19815 at commit [`3a59b32`](https://github.com/apache/spark/commit/3a59b32f2582ea5b734ebd94c5dd83c3cf1af625).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84170/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153061379
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153080574
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -494,9 +495,8 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    -  public void loadBytes(ColumnarArray array) {
    -    array.byteArray = byteData;
    -    array.byteArrayOffset = array.offset;
    +  protected UTF8String getUTF8String0(int rowId, int count) {
    --- End diff --
    
    Move it to 
    ```
      //
      // APIs dealing with Bytes
      //
    ```
    
    Any better names?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84203/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Thanks! Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    LGTM except one comment


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19815
  
    **[Test build #84170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84170/testReport)** for PR 19815 at commit [`ae7db88`](https://github.com/apache/spark/commit/ae7db88d8e7e6705058121782df825a8a2eb965d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153082361
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
    @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
       @Override
       public UTF8String getUTF8String(int rowId) {
         if (dictionary == null) {
    -      ColumnarArray a = getByteArray(rowId);
    -      return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
    +      return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
         } else {
           byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
           return UTF8String.fromBytes(bytes);
    --- End diff --
    
    That seems orthogonal to this issue. It would be nice if we could avoid the copy though. That would require some work on the dictionary code path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19815#discussion_r153080187
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
    @@ -494,9 +495,8 @@ public void putArray(int rowId, int offset, int length) {
       }
     
       @Override
    -  public void loadBytes(ColumnarArray array) {
    -    array.byteArray = byteData;
    -    array.byteArrayOffset = array.offset;
    +  protected UTF8String getUTF8String0(int rowId, int count) {
    --- End diff --
    
    Move it to 
    ```
      //
      // APIs dealing with Bytes
      //
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org