You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/30 03:54:55 UTC

[GitHub] [spark] LuciferYang opened a new pull request, #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

LuciferYang opened a new pull request, #37029:
URL: https://github.com/apache/spark/pull/37029

   ### What changes were proposed in this pull request?
   Similar of SPARK-39231, this pr change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170955123

   after this pr
   
   ```
   ================================================================================================
   Vectorized Scan Multiple Partition Columns
   ================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 1 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      336            357          23         46.7          21.4       1.0X
   Parquet Vectorized: DataPageV2                      329            345          17         47.8          20.9       1.0X
   ORC Vectorized                                      419            453          49         37.5          26.6       0.8X
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 2 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      411            418           6         38.2          26.2       1.0X
   Parquet Vectorized: DataPageV2                      408            425          28         38.6          25.9       1.0X
   ORC Vectorized                                      427            432           5         36.8          27.2       1.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1031-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 3 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      432            442          14         36.4          27.5       1.0X
   Parquet Vectorized: DataPageV2                      430            435           7         36.6          27.3       1.0X
   ORC Vectorized                                      463            466           4         34.0          29.4       0.9X
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911655692


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
     for (int i = 0; i < requiredFields.length; i++) {
       DataType dt = requiredFields[i].dataType();
       if (requestedPartitionColIds[i] != -1) {
-        OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+        ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
         ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);

Review Comment:
   <img width="845" alt="image" src="https://user-images.githubusercontent.com/1475305/176837640-94f67389-9476-4ccb-a597-154d328f61ab.png">
   
   `ConstantColumnVectorBenchmark` and `ColumnVectorSuite` still use this method, can move it to `ConstantColumnVectorBenchmark` and remove the case in `ColumnVectorSuite`. Do all in this pr?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170954614

   before
   
   ```
   ================================================================================================
   Vectorized Scan Multiple Partition Columns
   ================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 1 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      411            423          19         38.3          26.1       1.0X
   Parquet Vectorized: DataPageV2                      412            416           5         38.2          26.2       1.0X
   ORC Vectorized                                      523            527           3         30.1          33.2       0.8X
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 2 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      417            422           4         37.7          26.5       1.0X
   Parquet Vectorized: DataPageV2                      418            426           8         37.7          26.5       1.0X
   ORC Vectorized                                      483            487           6         32.6          30.7       0.9X
   
   OpenJDK 64-Bit Server VM 1.8.0_332-b09 on Linux 5.13.0-1023-azure
   Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
   Vectorized Scan 3 partition columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Parquet Vectorized: DataPageV1                      428            437          13         36.8          27.2       1.0X
   Parquet Vectorized: DataPageV2                      427            430           3         36.8          27.2       1.0X
   ORC Vectorized                                      543            569          25         29.0          34.5       0.8X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1171852664

   cc @sunchao @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170953820

   test micro benchmark as follows:
   
   ```
           import spark.implicits._
           spark.range(values).map(_ => Random.nextLong).createOrReplaceTempView("t1")
   
           val sqlPart = (1 to pColumns).map(i => s"cast(value % ${i + 1} AS STRING) AS p${i + 1}")
             .mkString(",")
           val sqlText = s"SELECT $sqlPart, value AS id FROM t1"
           val partitions = (1 to pColumns).map(i => s"p${i + 1}")
           prepareTable(dir, spark.sql(sqlText), partitions, onlyParquetOrc = true)
   
           val partitionsSqlPart = partitions.mkString(",")
   
           withParquetVersions { version =>
             benchmark.addCase(s"Parquet Vectorized: DataPage$version") { _ =>
               spark.sql(s"select $partitionsSqlPart from parquet${version}Table").noop()
             }
           }
   
           benchmark.addCase("ORC Vectorized") { _ =>
             spark.sql(s"SELECT $partitionsSqlPart FROM orcTable").noop()
           }
   
           benchmark.run()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao closed pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`
URL: https://github.com/apache/spark/pull/37029


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1170804316

   Some UTs failed, will fix later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [WIP][SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r910643105


##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala:
##########
@@ -83,10 +80,9 @@ class OrcColumnarBatchReaderSuite extends QueryTest with SharedSparkSession {
       assert(batch.numCols() === 2)
 
       assert(batch.column(0).isInstanceOf[OrcColumnVector])
-      assert(batch.column(1).isInstanceOf[OnHeapColumnVector])
+      assert(batch.column(1).isInstanceOf[ConstantColumnVector])
 
-      val p1 = batch.column(1).asInstanceOf[OnHeapColumnVector]
-      assert(isConstant.get(p1).asInstanceOf[Boolean]) // Partition column is constant.

Review Comment:
   This assertion is not required when using `ConstantColumnVector`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911657914


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
     for (int i = 0; i < requiredFields.length; i++) {
       DataType dt = requiredFields[i].dataType();
       if (requestedPartitionColIds[i] != -1) {
-        OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+        ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
         ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);

Review Comment:
   I see. I think it's better to do it in a separate PR. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911660321


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
     for (int i = 0; i < requiredFields.length; i++) {
       DataType dt = requiredFields[i].dataType();
       if (requestedPartitionColIds[i] != -1) {
-        OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+        ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
         ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);

Review Comment:
   OK ~ I'll finish that if this pr merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on a diff in pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #37029:
URL: https://github.com/apache/spark/pull/37029#discussion_r911653363


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##########
@@ -168,9 +169,8 @@ public void initBatch(
     for (int i = 0; i < requiredFields.length; i++) {
       DataType dt = requiredFields[i].dataType();
       if (requestedPartitionColIds[i] != -1) {
-        OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt);
+        ConstantColumnVector partitionCol = new ConstantColumnVector(capacity, dt);
         ColumnVectorUtils.populate(partitionCol, partitionValues, requestedPartitionColIds[i]);

Review Comment:
   after this PR do we still need the other `populate` method on `WritableColumnVector`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1171996796

   Merged to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #37029: [SPARK-39638][SQL] Change to use `ConstantColumnVector` to store partition columns in `OrcColumnarBatchReader`

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #37029:
URL: https://github.com/apache/spark/pull/37029#issuecomment-1172000086

   thanks @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org