You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/30 03:54:55 UTC
[GitHub] [spark] LuciferYang opened a new pull request, #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
LuciferYang opened a new pull request, #37029:
URL: https://github.com/apache/spark/pull/37029
### What changes were proposed in this pull request?
Similar of SPARK-39231, this pr change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170955123
after this pr
```
================================================================================================
Vectorized Scan Multiple Partition Columns
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 1 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 336 357 23 46.7 21.4 1.0X
Parquet Vectorized: DataPageV2 329 345 17 47.8 20.9 1.0X
ORC Vectorized 419 453 49 37.5 26.6 0.8X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 2 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 411 418 6 38.2 26.2 1.0X
Parquet Vectorized: DataPageV2 408 425 28 38.6 25.9 1.0X
ORC Vectorized 427 432 5 36.8 27.2 1.0X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 3 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 432 442 14 36.4 27.5 1.0X
Parquet Vectorized: DataPageV2 430 435 7 36.6 27.3 1.0X
ORC Vectorized 463 466 4 34.0 29.4 0.9X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911655692
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
for (int i = 0; i < requiredFields.length; i++) {
DataType dt = requiredFields[i].dataType();
if (requestedPartitionColIds[i] != -1) {
- OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+ ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);
Review Comment:
<img width="845" alt="image" src="https://user-images.githubusercontent.com/1475305/176837640-94f67389-9476-4ccb-a597-154d328f61ab.png">
`ConstantColumnVectorBenchmark` and `ColumnVectorSuite` still use this method, can move it to `ConstantColumnVectorBenchmark` and remove the case in `ColumnVectorSuite`. Do all in this pr?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170954614
before
```
================================================================================================
Vectorized Scan Multiple Partition Columns
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 1 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 411 423 19 38.3 26.1 1.0X
Parquet Vectorized: DataPageV2 412 416 5 38.2 26.2 1.0X
ORC Vectorized 523 527 3 30.1 33.2 0.8X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 2 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 417 422 4 37.7 26.5 1.0X
Parquet Vectorized: DataPageV2 418 426 8 37.7 26.5 1.0X
ORC Vectorized 483 487 6 32.6 30.7 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Vectorized Scan 3 partition columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized: DataPageV1 428 437 13 36.8 27.2 1.0X
Parquet Vectorized: DataPageV2 427 430 3 36.8 27.2 1.0X
ORC Vectorized 543 569 25 29.0 34.5 0.8X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1171852664
cc @sunchao @dongjoon-hyun
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170953820
test micro benchmark as follows:
```
import spark.implicits._
spark.range(values).map(_ => Random.nextLong).createOrReplaceTempView("t1")
val sqlPart = (1 to pColumns).map(i => s"cast(value % ${i + 1} AS STRING) AS p${i + 1}")
.mkString(",")
val sqlText = s"SELECT $sqlPart, value AS id FROM t1"
val partitions = (1 to pColumns).map(i => s"p${i + 1}")
prepareTable(dir, spark.sql(sqlText), partitions, onlyParquetOrc = true)
val partitionsSqlPart = partitions.mkString(",")
withParquetVersions { version =>
benchmark.addCase(s"Parquet Vectorized: DataPage$version") { _ =>
spark.sql(s"select $partitionsSqlPart from parquet${version}Table").noop()
}
}
benchmark.addCase("ORC Vectorized") { _ =>
spark.sql(s"SELECT $partitionsSqlPart FROM orcTable").noop()
}
benchmark.run()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao closed pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
URL: https://github.com/apache/spark/pull/37029
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170804316
Some UTs failed, will fix later
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r910643105
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala:
##########
@@ -83,10 +80,9 @@ class OrcColumnarBatchReaderSuite extends QueryTest with SharedSparkSession {
assert(batch.numCols() === 2)
assert(batch.column(0).isInstanceOf[OrcColumnVector])
- assert(batch.column(1).isInstanceOf[OnHeapColumnVector])
+ assert(batch.column(1).isInstanceOf[ConstantColumnVector])
- val p1 = batch.column(1).asInstanceOf[OnHeapColumnVector]
- assert(isConstant.get(p1).asInstanceOf[Boolean]) // Partition column is constant.
Review Comment:
This assertion is not required when using `ConstantColumnVector`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911657914
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
for (int i = 0; i < requiredFields.length; i++) {
DataType dt = requiredFields[i].dataType();
if (requestedPartitionColIds[i] != -1) {
- OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+ ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);
Review Comment:
I see. I think it's better to do it in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911660321
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
for (int i = 0; i < requiredFields.length; i++) {
DataType dt = requiredFields[i].dataType();
if (requestedPartitionColIds[i] != -1) {
- OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+ ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);
Review Comment:
OK ~ I'll finish that if this pr merged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911653363
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
for (int i = 0; i < requiredFields.length; i++) {
DataType dt = requiredFields[i].dataType();
if (requestedPartitionColIds[i] != -1) {
- OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+ ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);
Review Comment:
after this PR do we still need the other `populate` method on `WritableColumnVector`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1171996796
Merged to master, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1172000086
thanks @sunchao
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org