You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/11/24 16:10:33 UTC
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19815
[SPARK-22602][SQL] remove ColumnVector#loadBytes
## What changes were proposed in this pull request?
`ColumnVector#loadBytes` is only used as an optimization for reading UTF8String in `WritableColumnVector`, this PR moves this optimization to `WritableColumnVector` and simplified it.
## How was this patch tested?
existing test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark load-bytes
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19815.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19815
----
commit ae7db88d8e7e6705058121782df825a8a2eb965d
Author: Wenchen Fan <we...@databricks.com>
Date: 2017-11-24T16:07:55Z
remove ColumnVector#loadBytes
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84172/testReport)** for PR 19815 at commit [`3a59b32`](https://github.com/apache/spark/commit/3a59b32f2582ea5b734ebd94c5dd83c3cf1af625).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84170 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84170/testReport)** for PR 19815 at commit [`ae7db88`](https://github.com/apache/spark/commit/ae7db88d8e7e6705058121782df825a8a2eb965d).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153082230
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
--- End diff --
This is a bit of a non-issue: the current on-heap code path already avoids making copies.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19815
LGTM pending Jenkins.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153082248
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
--- End diff --
It might be a bit better to use `arrayData()` instead of `childColumns[0]`, they are practically the same, but it makes the intent a bit clearer.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84172/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153082266
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
} else {
byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
return UTF8String.fromBytes(bytes);
}
}
+ /**
+ * Returns a UTF8String whose data comes from [rowId, rowId + count] of this vector.
+ * This method is similar to {@link ColumnVector#getBytes(int, int)}, but can save data copy as
+ * UTF8String is used as a pointer.
+ */
+ protected abstract UTF8String getUTF8String0(int rowId, int count);
+
/**
* Returns the byte array for rowId.
*/
@Override
public byte[] getBinary(int rowId) {
if (dictionary == null) {
- ColumnarArray array = getByteArray(rowId);
- byte[] bytes = new byte[array.length];
- System.arraycopy(array.byteArray, array.byteArrayOffset, bytes, 0, bytes.length);
- return bytes;
+ return childColumns[0].getBytes(getArrayOffset(rowId), getArrayLength(rowId));
--- End diff --
Same here, use `arrayData()`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84203/testReport)** for PR 19815 at commit [`5711bb2`](https://github.com/apache/spark/commit/5711bb20e9d598c8716d7d8636466ee82e623a20).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153060336
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
--- End diff --
Shall we add comment that `getUTF8String` reuse the data in column vector? It seems different than other getXXX APIs
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19815
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153060413
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
} else {
byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
return UTF8String.fromBytes(bytes);
--- End diff --
hmm, but looks `decodeToBinary` will copy byte data?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84203/testReport)** for PR 19815 at commit [`5711bb2`](https://github.com/apache/spark/commit/5711bb20e9d598c8716d7d8636466ee82e623a20).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153080542
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
--- End diff --
It looks risky if we do not make a copy.
If we plan to avoid the unnecessary data copy by this API, we should rename the API name `getUTF8String` and check all the callers whether they do not break the assumption.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19815
I will look this Sunday.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19815
cc @michal-databricks @kiszk @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84172/testReport)** for PR 19815 at commit [`3a59b32`](https://github.com/apache/spark/commit/3a59b32f2582ea5b734ebd94c5dd83c3cf1af625).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84170/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153061379
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
--- End diff --
+1
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153080574
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
@@ -494,9 +495,8 @@ public void putArray(int rowId, int offset, int length) {
}
@Override
- public void loadBytes(ColumnarArray array) {
- array.byteArray = byteData;
- array.byteArrayOffset = array.offset;
+ protected UTF8String getUTF8String0(int rowId, int count) {
--- End diff --
Move it to
```
//
// APIs dealing with Bytes
//
```
Any better names?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84203/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/19815
Thanks! Merged to master.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:
https://github.com/apache/spark/pull/19815
LGTM except one comment
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19815
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19815
**[Test build #84170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84170/testReport)** for PR 19815 at commit [`ae7db88`](https://github.com/apache/spark/commit/ae7db88d8e7e6705058121782df825a8a2eb965d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153082361
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java ---
@@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) {
@Override
public UTF8String getUTF8String(int rowId) {
if (dictionary == null) {
- ColumnarArray a = getByteArray(rowId);
- return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length);
+ return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId));
} else {
byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId));
return UTF8String.fromBytes(bytes);
--- End diff --
That seems orthogonal to this issue. It would be nice if we could avoid the copy though. That would require some work on the dictionary code path.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19815#discussion_r153080187
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java ---
@@ -494,9 +495,8 @@ public void putArray(int rowId, int offset, int length) {
}
@Override
- public void loadBytes(ColumnarArray array) {
- array.byteArray = byteData;
- array.byteArrayOffset = array.offset;
+ protected UTF8String getUTF8String0(int rowId, int count) {
--- End diff --
Move it to
```
//
// APIs dealing with Bytes
//
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org