You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/17 03:15:33 UTC
[GitHub] [spark] LuciferYang opened a new pull request, #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`
LuciferYang opened a new pull request, #36571:
URL: https://github.com/apache/spark/pull/36571
### What changes were proposed in this pull request?
This pr add a `putByteArrays` method to `WritableColumnVector` as follows:
```java
public int putByteArrays(int rowId, int total, byte[] value)
```
This method used to support setting multiple duplicate `byte[]` to `WritableColumnVector`. Since `byte[] value` is fixed length, memory can allocated at one time without calling `reserve(int requiredCapacity)` method many times.
The new method is applicable to `ColumnVectorUtils.populate` method with `StringType` and partial `DecimalType` scenario, this corresponds to the Vectorized Partition Column filling of Parquet and Orc.
### Why are the changes needed?
Reduce `reserve(int requiredCapacity)` call times to avoid memory allocation times in setting multiple duplicate fixed length `byte[]` to `WritableColumnVector` scene.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GA
- Add a `StringType` partition column test scenario in `DataSourceReadBenchmark`.
**Before**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table(Partition Column Type = StringType): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Data column - CSV 19487 19517 42 0.8 1239.0 1.0X
Data column - Json 12943 12948 7 1.2 822.9 1.5X
Data column - Parquet Vectorized: DataPageV1 219 224 6 72.0 13.9 89.2X
Data column - Parquet Vectorized: DataPageV2 494 501 7 31.9 31.4 39.5X
Data column - Parquet MR: DataPageV1 2515 2521 9 6.3 159.9 7.7X
Data column - Parquet MR: DataPageV2 2327 2337 14 6.8 148.0 8.4X
Data column - ORC Vectorized 303 306 2 51.9 19.3 64.3X
Data column - ORC MR 2126 2130 6 7.4 135.2 9.2X
Partition column - CSV 6480 6482 2 2.4 412.0 3.0X
Partition column - Json 10564 10572 11 1.5 671.6 1.8X
Partition column - Parquet Vectorized: DataPageV1 53 58 16 296.6 3.4 367.5X
Partition column - Parquet Vectorized: DataPageV2 52 57 10 303.6 3.3 376.1X
Partition column - Parquet MR: DataPageV1 1231 1232 2 12.8 78.3 15.8X
Partition column - Parquet MR: DataPageV2 1227 1229 3 12.8 78.0 15.9X
Partition column - ORC Vectorized 52 57 8 300.3 3.3 372.0X
Partition column - ORC MR 1334 1343 11 11.8 84.8 14.6X
Both columns - CSV 19608 19626 25 0.8 1246.6 1.0X
Both columns - Json 13003 13018 22 1.2 826.7 1.5X
Both columns - Parquet Vectorized: DataPageV1 262 269 7 60.1 16.6 74.4X
Both columns - Parquet Vectorized: DataPageV2 538 541 6 29.3 34.2 36.3X
Both columns - Parquet MR: DataPageV1 2569 2570 2 6.1 163.3 7.6X
Both columns - Parquet MR: DataPageV2 2343 2361 26 6.7 148.9 8.3X
Both columns - ORC Vectorized 344 345 1 45.7 21.9 56.6X
Both columns - ORC MR 2173 2178 8 7.2 138.1 9.0X
```
**After**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table(Partition Column Type = StringType): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------------
Data column - CSV 22546 22554 11 0.7 1433.4 1.0X
Data column - Json 13638 13638 0 1.2 867.1 1.7X
Data column - Parquet Vectorized: DataPageV1 208 214 9 75.7 13.2 108.5X
Data column - Parquet Vectorized: DataPageV2 488 492 5 32.2 31.0 46.2X
Data column - Parquet MR: DataPageV1 2625 2631 9 6.0 166.9 8.6X
Data column - Parquet MR: DataPageV2 2323 2328 8 6.8 147.7 9.7X
Data column - ORC Vectorized 296 300 6 53.1 18.8 76.1X
Data column - ORC MR 2154 2156 2 7.3 136.9 10.5X
Partition column - CSV 6410 6434 34 2.5 407.6 3.5X
Partition column - Json 10021 10028 10 1.6 637.1 2.2X
Partition column - Parquet Vectorized: DataPageV1 51 55 10 306.9 3.3 439.9X
Partition column - Parquet Vectorized: DataPageV2 51 55 9 308.1 3.2 441.6X
Partition column - Parquet MR: DataPageV1 1207 1209 2 13.0 76.7 18.7X
Partition column - Parquet MR: DataPageV2 1222 1237 22 12.9 77.7 18.5X
Partition column - ORC Vectorized 52 55 8 304.2 3.3 436.1X
Partition column - ORC MR 1310 1310 0 12.0 83.3 17.2X
Both columns - CSV 22310 22318 11 0.7 1418.4 1.0X
Both columns - Json 13625 13629 5 1.2 866.3 1.7X
Both columns - Parquet Vectorized: DataPageV1 248 256 13 63.4 15.8 90.9X
Both columns - Parquet Vectorized: DataPageV2 529 555 50 29.7 33.7 42.6X
Both columns - Parquet MR: DataPageV1 2634 2641 10 6.0 167.5 8.6X
Both columns - Parquet MR: DataPageV2 2375 2377 3 6.6 151.0 9.5X
Both columns - ORC Vectorized 338 339 1 46.5 21.5 66.6X
Both columns - ORC MR 2189 2193 5 7.2 139.2 10.3X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang closed pull request #36571: [WIP][SPARK-39202][SQL] Introduce a `putByteArrays` method for `WritableColumnVector` to support setting multiple duplicate `byte[]`
Posted by GitBox <gi...@apache.org>.
LuciferYang closed pull request #36571: [WIP][SPARK-39202][SQL] Introduce a `putByteArrays` method for `WritableColumnVector` to support setting multiple duplicate `byte[]`
URL: https://github.com/apache/spark/pull/36571
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #36571:
URL: https://github.com/apache/spark/pull/36571#discussion_r874324355
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/ColumnVectorUtilsBenchmark.scala:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.commons.lang3.RandomStringUtils
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.benchmark.BenchmarkBase
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.vectorized.{ColumnVectorUtils, OffHeapColumnVector, OnHeapColumnVector}
+import org.apache.spark.sql.types.StringType
+
+
+/**
+ * Benchmark for ColumnVectorUtils.populate use OnHeapColumnVector with OffHeapColumnVector
+ * To run this benchmark:
+ * {{{
+ * 1. without sbt: bin/spark-submit --class <this class>
+ * --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>
+ * 2. build/sbt "sql/test:runMain <this class>"
+ * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
+ * Results will be written to "benchmarks/ColumnVectorUtilsBenchmark-results.txt".
+ * }}}
+ */
+object ColumnVectorUtilsBenchmark extends BenchmarkBase {
Review Comment:
Will delete after test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36571: [SPARK-39202][SQL] Introduce a `putByteArrays` method to `WritableColumnVector` to support setting multiple duplicate `byte[]`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #36571:
URL: https://github.com/apache/spark/pull/36571#issuecomment-1128419927
For `ColumnVectorUtils.populate` method:
```scala
def testPopulate(valuesPerIteration: Int, length: Int): Unit = {
val batchSize = 4096
val onHeapColumnVector = new OnHeapColumnVector(batchSize, StringType)
val offHeapColumnVector = new OffHeapColumnVector(batchSize, StringType)
val benchmark = new Benchmark(
s"Test ColumnVectorUtils.populate, row length = $length",
valuesPerIteration * batchSize,
output = output)
val builder = new UTF8StringBuilder()
builder.append(RandomStringUtils.random(length))
val row = InternalRow(builder.build())
benchmark.addCase("OnHeapColumnVector") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
onHeapColumnVector.reset()
ColumnVectorUtils.populate(onHeapColumnVector, row, 0)
}
}
benchmark.addCase("OffHeapColumnVector") { _: Int =>
for (_ <- 0L until valuesPerIteration) {
offHeapColumnVector.reset()
ColumnVectorUtils.populate(offHeapColumnVector, row, 0)
}
}
benchmark.run()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val valuesPerIteration = 100000
Seq(1, 5, 10, 15, 20).foreach { length =>
testPopulate(valuesPerIteration, length)
}
}
```
**Before**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 1: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 3381 3404 32 121.2 8.3 1.0X
OffHeapColumnVector 3931 3968 53 104.2 9.6 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 5: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4700 4767 96 87.2 11.5 1.0X
OffHeapColumnVector 5258 5356 139 77.9 12.8 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4920 4934 19 83.2 12.0 1.0X
OffHeapColumnVector 5007 5017 14 81.8 12.2 1.0X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 15: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 5227 5255 40 78.4 12.8 1.0X
OffHeapColumnVector 5626 5731 148 72.8 13.7 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 5226 5263 53 78.4 12.8 1.0X
OffHeapColumnVector 5526 5699 244 74.1 13.5 0.9X
```
**After**
```
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 1: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 3734 3742 11 109.7 9.1 1.0X
OffHeapColumnVector 3683 3683 0 111.2 9.0 1.0X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 5: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4085 4088 4 100.3 10.0 1.0X
OffHeapColumnVector 4770 4771 2 85.9 11.6 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 10: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4788 4789 1 85.5 11.7 1.0X
OffHeapColumnVector 4387 4387 0 93.4 10.7 1.1X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 15: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4669 4669 0 87.7 11.4 1.0X
OffHeapColumnVector 5197 5198 1 78.8 12.7 0.9X
OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1022-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test ColumnVectorUtils.populate, row length = 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------------------------------------------
OnHeapColumnVector 4769 4769 0 85.9 11.6 1.0X
OffHeapColumnVector 5441 5441 1 75.3 13.3 0.9X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #36571: [WIP][SPARK-39202][SQL] Introduce a `putByteArrays` method for `WritableColumnVector` to support setting multiple duplicate `byte[]`
Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #36571:
URL: https://github.com/apache/spark/pull/36571#issuecomment-1129540524
Maybe it's better to use a dictionary to store `StringType` partition column. I'm testing it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org