You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/22 02:37:44 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

HyukjinKwon opened a new pull request #31917:
URL: https://github.com/apache/spark/pull/31917


   ### What changes were proposed in this pull request?
   
   This PR updates CSVBenchmark especially we have a fix like https://github.com/apache/spark/pull/31858 that could potentially improve the performance.
   
   ### Why are the changes needed?
   
   To have the updated benchmark results.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Manually ran the benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803925735


   **[Test build #136331 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136331/testReport)** for PR 31917 at commit [`3575e48`](https://github.com/apache/spark/commit/3575e48604b9ec97aacfdbaa9f3b2e0b48f6a788).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598455353



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       It seems this is not very convenient to use, especially for new contributors.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803844052


   +1, LGTM, Merging this to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803787041


   @MaxGekk If we care about that, it would be great if we include that in benchmark results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803766091


   **[Test build #136331 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136331/testReport)** for PR 31917 at commit [`3575e48`](https://github.com/apache/spark/commit/3575e48604b9ec97aacfdbaa9f3b2e0b48f6a788).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803730648


   **[Test build #136322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136322/testReport)** for PR 31917 at commit [`750f92b`](https://github.com/apache/spark/commit/750f92bb48d2fb84a922ff4b9a39bac55e7a57cb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803794337


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40912/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803937741


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136331/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803846486


   I filed a JIRA for that: SPARK-34821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803766091


   **[Test build #136331 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136331/testReport)** for PR 31917 at commit [`3575e48`](https://github.com/apache/spark/commit/3575e48604b9ec97aacfdbaa9f3b2e0b48f6a788).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803785118


   @HyukjinKwon The purpose is to give others enough info about the environment to get the same benchmark results. Do you really think that:
   ```
   Java HotSpot(TM) 64-Bit Server VM 11.0.3+12-LTS on Mac OS X 10.15.7
   Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
   ```
   is enough? ok, how much memory should I have? 1MB RAM is enough?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803937741


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136331/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803730648


   **[Test build #136322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136322/testReport)** for PR 31917 at commit [`750f92b`](https://github.com/apache/spark/commit/750f92bb48d2fb84a922ff4b9a39bac55e7a57cb).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598442365



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       It would be great if we have a Docker image that keeps the exact same version like release container. In that way, there'd be virtually no overhead.

##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       It would be great if we have a Docker image that keeps the exact same version like release script. In that way, there'd be virtually no overhead.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803840351


   I had an offline discussion with @MaxGekk.
   
   I'm thinking about setting a GitHub Actions workflow like "Running tests in your forked repository using GitHub Actions" https://spark.apache.org/developer-tools.html, and we run the benchmark always in GA machines.
   
   I guess the machine specifications are still not guaranteed to be same but would expect less variance compared to non-pinned env, and should be very easy for other people to run (just go to your fork, run a benchmark by UI, and download the benchmark results). I will try to take a look probably this week.
   
   Meanwhile, I think we can just unblock this PR and go ahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk closed pull request #31917:
URL: https://github.com/apache/spark/pull/31917


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803792545


   @MaxGekk, We should better have a way to do that, or at least document that we should do that. All I read is:
   https://github.com/apache/spark/blob/d65f534c5ad4385b7c5198f15cb014e1d24e47c9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala#L30-L40
   
   If there are extra steps to do it, let's start another discussion and document it (I personally don't agree with this approach though). It would be great if we have an automated script.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598438961



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       > I don't think it matters.
   
   You changed almost everything in the environment: CPU, memory, OS, JVM. Also you run the benchmark on your laptop, it seems. For instance, this impacts on `Stddev`, it became bigger because other processes (or some other activities in your laptop) influenced on the benchmark.
   
   One more thing, if you look at existing benchmark results. Almost all of them were launched in the same environment. In my PRs and @dongjoon-hyun 's PRs, we pointed out the environment, see https://github.com/apache/spark/pull/30118, https://github.com/apache/spark/pull/28613.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803752597


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803794337


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40912/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803817014


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598427096



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       @HyukjinKwon I have older JVM ;-) BTW, can't you re-run the benchmark on the same EC2 instance?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803762552






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803809435


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803923251


   @HyukjinKwon @MaxGekk We can use [`Hosting your own runners`](https://docs.github.com/en/actions/hosting-your-own-runners). This is an example:
   https://github.com/wangyum/spark/blob/test-ci/.github/workflows/benchmark.yml#L11
   https://github.com/wangyum/spark/runs/2164700670


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803822693


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40915/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598440649



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       I disagree with pinning an env to test. That will necessarily bring overhead for the dev, and discourage other contributors to keep it updated. The purpose of benchmark codes is just to refer the numbers/ratios, and to make other developers easier to produce the numbers, etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803802814


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803792545


   @MaxGekk, We should better have a way to do that, or at least document that we should do extra steps. All I read is:
   https://github.com/apache/spark/blob/d65f534c5ad4385b7c5198f15cb014e1d24e47c9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVBenchmark.scala#L30-L40
   
   If there are extra steps to do it, let's start another discussion and document it (FWIW I personally don't agree with having extra steps). It would be great if we have an automated script.
   
   Until we have them, I don't think it's something required. I already see other envs were used in the past benchmark results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803761914


   **[Test build #136322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136322/testReport)** for PR 31917 at commit [`750f92b`](https://github.com/apache/spark/commit/750f92bb48d2fb84a922ff4b9a39bac55e7a57cb).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803782570


   I think the benchmark results include that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803822675


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598449310



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       @MaxGekk Could we provide a machine to run these benchmarks?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598440649



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       I disagree with pinning an env to test. That will unnecessarily bring overhead for the dev, and discourage other contributors to keep it updated. The purpose of benchmark codes is just to refer the numbers/ratios, and to make other developers easier to produce the numbers, etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598452062



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       See https://github.com/apache/spark/pull/28613, it is `r3.xlarge`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803817160


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40915/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803750709


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40904/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803790976


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803780677


   @HyukjinKwon Could you update PR's description and point out the environment in which you run the benchmark, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598428153



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       I don't think it matters. If we need to run it in the same machine with the same version, we should write down how to do that and force other people to do that. It will probably need a discussion .. this is non-trivial overhead, and I personally believe people won't like it very much ..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803822693


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40915/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803817014


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40914/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803739966


   Sure, running now 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803762552






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803844537


   Thank you @MaxGekk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803794304


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40912/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on pull request #31917:
URL: https://github.com/apache/spark/pull/31917#issuecomment-803790641


   @HyukjinKwon I care of reproducible benchmark results. Currently, you don't provide enough info to reproduce the same. I would prefer to follow scientific approach, and have a chance to verify your results if it is needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #31917: [SPARK-34815][SQL] Update CSVBenchmark

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #31917:
URL: https://github.com/apache/spark/pull/31917#discussion_r598447129



##########
File path: sql/core/benchmarks/CSVBenchmark-results.txt
##########
@@ -2,66 +2,66 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15.7

Review comment:
       IMHO, Docker image is not enough. Hardware matter too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org