You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "jiangjiguang (via GitHub)" <gi...@apache.org> on 2023/04/10 03:26:13 UTC

[GitHub] [spark] jiangjiguang opened a new pull request, #40719: [WIP]Speed up parquet reading with Java Vector API

jiangjiguang opened a new pull request, #40719:
URL: https://github.com/apache/spark/pull/40719

### What changes were proposed in this pull request?
Parquet has supported vector read speed up with this PR https://github.com/apache/parquet-mr/pull/1011
The performance gain is 4x ~ 8x according to the parquet microbenchmark
TPC-H(SF100) Q6 has 11% performance increase with Apache Spark integrating parquet vector optimization

### Why are the changes needed?
This PR used to support parquet vector optimization

### Does this PR introduce _any_ user-facing change?
Add configuration spark.sql.parquet.vector512.read.enabled, If true and CPU contains avx512vbmi & avx512_vbmi2 instruction set, parquet decodes using Java Vector API. For Intel CPU, Ice Lake or newer contains the required instruction set.

### How was this patch tested?
For the test case, there are some problems to fix:
1. It is necessary to Parquet-mr community release new java version to use the parquet vector optimization.
2. Parquet Vector optimization does not release default, so users have to build parquet with mvn clean install -P vector-plugins manually to get the parquet-encoding-vector-{VERSION}.jar and put it on the {SPARK_HOME}/jars path
3. github doesn't support select runners with specific instruction set. So it is impossible (a self-hosted runner can do it) to verify the optimization on github runners machine.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #40719: [WIP]Speed up parquet reading with Java Vector API
URL: https://github.com/apache/spark/pull/40719


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jiangjiguang commented on pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "jiangjiguang (via GitHub)" <gi...@apache.org>.

jiangjiguang commented on PR #40719:
URL: https://github.com/apache/spark/pull/40719#issuecomment-1501461669

   @LuciferYang @wangyum @frankliee I have added a benchmark.
   
   This is the result:
   ```
   Java HotSpot(TM) 64-Bit Server VM 17.0.5+9-LTS-191 on Linux 5.15.0-60-generic
   Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
   Selection:                                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Without Java Vector API                            4696           4802          89         21.3          47.0       1.0X
   With Java Vector API                               3742           3927         230         26.7          37.4       1.3X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jiangjiguang commented on pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "jiangjiguang (via GitHub)" <gi...@apache.org>.

jiangjiguang commented on PR #40719:
URL: https://github.com/apache/spark/pull/40719#issuecomment-1501347591

   @LuciferYang @wangyum @frankliee Since parquet-mr has released 1.13.0, So I resubmit the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #40719:
URL: https://github.com/apache/spark/pull/40719#discussion_r1162874759


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java:
##########
@@ -928,11 +956,35 @@ private boolean readNextGroup() {
           }
           currentBufferIdx = 0;
           int valueIndex = 0;
-          while (valueIndex < this.currentCount) {
-            // values are bit packed 8 at a time, so reading bitWidth will always work
-            ByteBuffer buffer = in.slice(bitWidth);
-            this.packer.unpack8Values(buffer, buffer.position(), this.currentBuffer, valueIndex);
-            valueIndex += 8;
+          if (vector512Support && SQLConf.get().parquetVector512Read()) {

Review Comment:
   It seems that there is currently no corresponding UTs, and if not all GA machines can be used to check this part, how can we continue to ensure its robustness?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jiangjiguang commented on pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "jiangjiguang (via GitHub)" <gi...@apache.org>.

jiangjiguang commented on PR #40719:
URL: https://github.com/apache/spark/pull/40719#issuecomment-1501350342

   @frankliee Sorry for delay. The PR only supports AVX512, does not support AVX256.
   Your question "Do we need to create SparkContext in static code ?"  because I want to get the SQL configuration sql.parquet.vector512.read.enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #40719: [WIP]Speed up parquet reading with Java Vector API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #40719:
URL: https://github.com/apache/spark/pull/40719#issuecomment-1644815618

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org