You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/28 11:36:36 UTC

[GitHub] [spark] eejbyfeldt opened a new pull request, #38427: [SPAR-40950

eejbyfeldt opened a new pull request, #38427:
URL: https://github.com/apache/spark/pull/38427

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   In FetchBlockRequest we currently store a `Seq[FetchBlockInfo]` as part of the function `isRemoteAddressMaxedOut` (probably other places as well, but this is the function that showd up in my profileling) we use the length of this Seq. In scala 2.12 `Seq` is an alias for `scala.collection.Seq` but in 2.13 it an alias for `scala.collection.immutable.Seq`. This means that in when for example we call `toSeq` on a `ArrayBuffer` in 2.12 we do nothing and the `blocks` in the `FetchRequest` will be backed by something with a cheap `length` but in 2.13 we end up copying the data to a `List`. 
   
   This PR solves this changing the `Seq` to and `IndexedSeq` and therefore making the expectation of a cheap length function explicit. This means that we some places will do an extra copy in scala 2.13 compared to 2.12 (was also the case before this PR). If we wanted to avoid this copy we should instead change it to use `scala.collection.IndexedSeq` so we would have the same types in both 2.13 and 2.12.  
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   The performance for ShuffleBlockFetcherIterator is much worse on Scala 2.13 than 2.12. Have seen cases were the overhead of repeatedly calculating the length is as much as 20% of cpu time (and could probably be even worse for larger shuffles).
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   No. I think the interface changes are only on private classes.
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   Existing specs. 
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296334385

   We should stick to maintaining it as immutable to set the right expectations - else future evolution of the code will assume it is fine to mutate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300620236

   OK, I think we can't accept that much perf degradation. If there's a simple way to refactor the code to make both faster, that seems OK. Ideally we avoid separate code branches for 2.12 vs 2.13, unless it's simple and important here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008677981


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   Maybe defining a helper method and providing different implementations for Scala 2.12 and 2.13 can give consideration to both, but this may be make the code too fragmented(like `org.apache.spark.util.Iterators`, It has implementations corresponding to different Scala versions). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296334751

   @LuciferYang, based on your results - do you think it is better to wrap it with a version specific api ? Returning Seq for both, but relying on different underlying impl for 2.12 vs 2.13 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1295854761

   > Alternate implementations feel like overkill unless this is a serious bottleneck.
   > Maybe a benchmark would help understand how big the difference is in a real workload?
   
   +1, Agree with @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296829344

   @eejbyfeldt Combined with the suggestions in Scala migration documents and @mridulm 's suggestions, if we want to solve this problem, I think we need to add a helper class and provide different implementations for different Scala versions. You can also wait to other's opinions .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1301541574

   Probably, if it's internal, and being mutable/immutable doesn't matter in the API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1297622753

   Thank you for the thorough investigation, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300578770

   What's the current state here - this change is still much slower on 2.12?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296360816

   > @LuciferYang, based on your results - do you think it is better to wrap it with a version specific api ? Returning Seq for both, but relying on different underlying impl for 2.12 vs 2.13 ? This conversion can be done as part of the construction of `FetchBlockRequest` - so limited in scope.
   > 
   > Thoughts @srowen ?
   
   If there is no other choice, I suggest doing that
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1301540958

   @srowen Does this mean that for similar cases, if it is an internal api, we can explicitly specify `Seq` as `collection.Seq` to avoid unnecessary memory copying for Scala 2.13? There should be some other similar cases in the Spark code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1304076366

   Merged to master.
   Thanks for working on this @eejbyfeldt !
   Thanks for the reviews @srowen, @dongjoon-hyun, @LuciferYang :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008675452


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -530,7 +530,7 @@ final class ShuffleBlockFetcherIterator(
           case _ => (doBatchFetch, false)
         }
       }
-      createFetchRequests(curBlocks.toSeq, address, isLast = true, collectedRemoteRequests,
+      createFetchRequests(curBlocks.toIndexedSeq, address, isLast = true, collectedRemoteRequests,

Review Comment:
   ditto



##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -502,20 +502,20 @@ final class ShuffleBlockFetcherIterator(
         case ShuffleBlockChunkId(_, _, _, _) =>
           if (curRequestSize >= targetRemoteRequestSize ||
             curBlocks.size >= maxBlocksInFlightPerAddress) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,
               collectedRemoteRequests, enableBatchFetch = false)
             curRequestSize = curBlocks.map(_.size).sum
           }
         case ShuffleMergedBlockId(_, _, _) =>
           if (curBlocks.size >= maxBlocksInFlightPerAddress) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,

Review Comment:
   ditto



##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -502,20 +502,20 @@ final class ShuffleBlockFetcherIterator(
         case ShuffleBlockChunkId(_, _, _, _) =>
           if (curRequestSize >= targetRemoteRequestSize ||
             curBlocks.size >= maxBlocksInFlightPerAddress) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,
               collectedRemoteRequests, enableBatchFetch = false)
             curRequestSize = curBlocks.map(_.size).sum
           }
         case ShuffleMergedBlockId(_, _, _) =>
           if (curBlocks.size >= maxBlocksInFlightPerAddress) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,
               collectedRemoteRequests, enableBatchFetch = false, forMergedMetas = true)
           }
         case _ =>
           // For batch fetch, the actual block in flight should count for merged block.
           val mayExceedsMaxBlocks = !doBatchFetch && curBlocks.size >= maxBlocksInFlightPerAddress
           if (curRequestSize >= targetRemoteRequestSize || mayExceedsMaxBlocks) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,

Review Comment:
   ditto



##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -502,20 +502,20 @@ final class ShuffleBlockFetcherIterator(
         case ShuffleBlockChunkId(_, _, _, _) =>
           if (curRequestSize >= targetRemoteRequestSize ||
             curBlocks.size >= maxBlocksInFlightPerAddress) {
-            curBlocks = createFetchRequests(curBlocks.toSeq, address, isLast = false,
+            curBlocks = createFetchRequests(curBlocks.toIndexedSeq, address, isLast = false,

Review Comment:
   ditto



##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   When using Scala 2.12, `blocks.toSeq` is a redundant operation, which will not cause any memory copy . However, 'blocks.toIndexedSeq' will cause a memory copy. This change may be beneficial to Scala 2.13, but it will harm the performance of Scala 2.12. 
   
   The current default version is still Scala 2.12, so I suggest leave this line as it is and add some TODOs to facilitate change when Scala 2.13 is the default version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1295039735

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300602808

   > What's the current state here - this change is still much slower on 2.12?
   
   @srowen 
   ```
   The overall trend is consistent with local tests. Using toIndexedSeq improves the performance of Scala 2.13 by 2 to 4 times, but degrades the performance of Scala 2.12 by more than 10 times
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008821518


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -457,7 +457,7 @@ final class ShuffleBlockFetcherIterator(
   }
 
   private def createFetchRequests(
-      curBlocks: Seq[FetchBlockInfo],
+      curBlocks: IndexedSeq[FetchBlockInfo],

Review Comment:
   Since `createFetchRequests` is a private method, can we directly change the type of `curBlocks` to `Buffer or ArrayBuffer`? This allows Scala 2.12 and 2.13 to both achieve optimal performance
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] eejbyfeldt commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1303204317

   The core module specs pass on 2.13 for me as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008677981


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   Maybe defining a helper method and providing different implementations for Scala 2.12 and 2.13 can give consideration to both, but this may be too complicated. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296738349

   > @LuciferYang, based on your results - do you think it is better to wrap it with a version specific api ? Returning Seq for both, but relying on different underlying impl for 2.12 vs 2.13 ? This conversion can be done as part of the construction of `FetchBlockRequest` - so limited in scope.
   > 
   > Thoughts @srowen ?
   
   Give more information for this way: convert `ArrayBuffer` to `immutable.Seq` or `immutable.IndexedSeq`, we should use `immutable.ArraySeq.from(buffer)`  or immutable.ArraySeq.unsafeWrapArray(buffer.toArray) instead of `buffer.toSeq(buffer.toIndexedSeq)`, `from` and `unsafeWrapArray` will be more efficient.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008821518


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -457,7 +457,7 @@ final class ShuffleBlockFetcherIterator(
   }
 
   private def createFetchRequests(
-      curBlocks: Seq[FetchBlockInfo],
+      curBlocks: IndexedSeq[FetchBlockInfo],

Review Comment:
   Since `createFetchRequests` is a private method, can we directly change the type of `curBlocks` to `Buffer or ArrayBuffer`? This allows Scala 2.12 and 2.13 to achieve optimal performance
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008677981


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   Maybe defining a helper method and providing different implementations for Scala 2.12 and 2.13 can give consideration to both, but this may be too complicated(like `org.apache.spark.util.Iterators`, It has implementations corresponding to different Scala versions). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296364776

   > We should stick to maintaining it as immutable to set the right expectations - else future evolution of the code will assume it is fine to mutate it.
   
   A little personal thought:the limitation of `immutable.Seq` is caused by Scala version migration, this comes from Scala 2.13, which changes the predefined `Seq` from `scala.collection.Seq` to ``scala.collection.immutable.Seq``,  we only promise that it is a `scala.collection.Seq` when using Scala 2.12 now and 2.13 is beta version, so if we agree to explicitly define `Seq` as `scala.collection.Seq` in the API, the things may become simple.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008675119


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   When using Scala 2.12, `blocks.toSeq` is a redundant operation, which will not cause any memory copy with Scala 2.12 . However, `blocks.toIndexedSeq` will cause a memory copy with Scala 2.12. This change may be beneficial to Scala 2.13, but it will harm the performance of Scala 2.12. 
   
   The current default version is still Scala 2.12, so I suggest leave this line as it is and add some TODOs to facilitate change when Scala 2.13 is the default version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38427:
URL: https://github.com/apache/spark/pull/38427#discussion_r1008675119


##########
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala:
##########
@@ -276,7 +276,7 @@ final class ShuffleBlockFetcherIterator(
           val (size, mapIndex) = infoMap(blockId)
           FetchBlockInfo(BlockId(blockId), size, mapIndex)
         }
-        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toSeq)))
+        results.put(DeferFetchRequestResult(FetchRequest(address, blocks.toIndexedSeq)))

Review Comment:
   When using Scala 2.12, `blocks.toSeq` is a redundant operation, which will not cause any memory copy with Scala 2.12 . However, 'blocks.toIndexedSeq' will cause a memory copy with Scala 2.12. This change may be beneficial to Scala 2.13, but it will harm the performance of Scala 2.12. 
   
   The current default version is still Scala 2.12, so I suggest leave this line as it is and add some TODOs to facilitate change when Scala 2.13 is the default version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296126997

   In long term, given we have to move to `IndexedSeq` for 2.13 - unless there are alternatives we are yet to consider - perhaps moving to `toIndexedSeq` is fine @LuciferYang ?
   I am assuming there is no path forward where 2.13 does not need a copy (and `.length` is used enough times that we should lessen the impact of that).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296208816

   The bench result by GA as follows:
   
   **Scala 2.12**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1              1           0         96.2          10.4       1.0X
   toIndexedSeq + Size                                                      13             14           1          7.4         134.4       0.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 10 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               1              1           0         93.3          10.7       1.0X
   toIndexedSeq + Size                                                       21             21           1          4.7         212.3       0.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 100 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                1              1           0         93.3          10.7       1.0X
   toIndexedSeq + Size                                                        87             88           1          1.1         870.9       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 1              1           0         93.3          10.7       1.0X
   toIndexedSeq + Size                                                        687            689           2          0.1        6870.1       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 10000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                  1              1           0         93.3          10.7       1.0X
   toIndexedSeq + Size                                                        6738           6738           0          0.0       67378.4       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 100000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                   1              1           0         93.3          10.7       1.0X
   toIndexedSeq + Size                                                        67357          67452         135          0.0      673566.0       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 2 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 1              1           0         78.6          12.7       1.0X
   toIndexedSeq + Size                                                        692            695           2          0.1        6921.7       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 3 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 1              1           0         67.9          14.7       1.0X
   toIndexedSeq + Size                                                        695            697           2          0.1        6945.6       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 4 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 2              2           0         57.6          17.4       1.0X
   toIndexedSeq + Size                                                        698            699           1          0.1        6983.1       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 5 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 2              2           0         49.8          20.1       1.0X
   toIndexedSeq + Size                                                        702            703           1          0.1        7016.7       0.0X
   ```
   
   **Scala 2.13**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              9              9           0         11.6          86.2       1.0X
   toIndexedSeq + Size                                                      17             18           1          5.8         173.8       0.5X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 10 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              20             21           1          4.9         202.2       1.0X
   toIndexedSeq + Size                                                       20             20           0          5.1         195.9       1.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 100 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              129            132           2          0.8        1294.2       1.0X
   toIndexedSeq + Size                                                        76             77           1          1.3         763.8       1.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1240           1241           2          0.1       12397.0       1.0X
   toIndexedSeq + Size                                                        577            579           2          0.2        5769.4       2.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 10000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              12279          12285           8          0.0      122791.8       1.0X
   toIndexedSeq + Size                                                        5641           5646           8          0.0       56405.9       2.2X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 100000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              130320         130553         330          0.0     1303202.4       1.0X
   toIndexedSeq + Size                                                        59120          59237         166          0.0      591196.0       2.2X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 2 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1675           1675           1          0.1       16747.3       1.0X
   toIndexedSeq + Size                                                        614            614           1          0.2        6136.6       2.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 3 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              2100           2104           6          0.0       21000.9       1.0X
   toIndexedSeq + Size                                                        614            615           1          0.2        6143.8       3.4X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 4 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              2544           2546           3          0.0       25442.4       1.0X
   toIndexedSeq + Size                                                        616            616           0          0.2        6161.8       4.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1022-azure
   Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
   Test size of Seq with buffer size 1000 and call .size 5 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              2983           2988           8          0.0       29829.5       1.0X
   toIndexedSeq + Size                                                        617            618           1          0.2        6165.7       4.8X
   ```
   
   The overall trend is consistent with local tests. Using `toIndexedSeq` improves the performance of Scala 2.13 by 2 to 4 times, but degrades the performance of Scala 2.12 by more than 10 times


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] eejbyfeldt commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300650411

   > OK, I think we can't accept that much perf degradation. If there's a simple way to refactor the code to make both faster, that seems OK. Ideally we avoid separate code branches for 2.12 vs 2.13, unless it's simple and important here
   
   I think the two options that have been discussed are either.
   
   Separate code branches for 2.12 and 2.13 converting the mutable collections to `Seq`  for 2.12 it would just be a no-op since `Seq` is alias for `scala.collection.Seq`. For 2.13 we would copy the data to a `ArraySeq` since in 2.13 `Seq` is an alias for `scala.collection.immutable.Seq`. The gain here is I think that when we are on 2.13 we use an immutable collection instead of `scala.collection.Seq` which might point to a mutable collection.
   
   The other option would be to just change the code to explicitly use `scala.collection.Seq` (using scala.collection.IndexedSeq would also be an option) instead of `Seq` and removing the explicit calls `toSeq` then it would have the same meaning and performance as the current 2.12 code.
   
   @srowen Which approach do think is preferable? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] asfgit closed pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
asfgit closed pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13
URL: https://github.com/apache/spark/pull/38427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300679989

   The latter sounds better right? is there any downside?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1300477025

   Move forward?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] eejbyfeldt commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
eejbyfeldt commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296802435

   > > 
   > 
   > A little personal thought:the limitation of `immutable.Seq` is caused by Scala version migration, this comes from Scala 2.13, which changes the predefined of `Seq` from `scala.collection.Seq(2.12)` to `scala.collection.immutable.Seq`, we only promise that it is a `scala.collection.Seq` when using Scala 2.12 now and 2.13 is beta version, so if we agree to explicitly define `Seq` as `scala.collection.Seq` in the API, the things may become simple(can avoid calling `toSeq` or `toIndexedSeq`).
   
   The reason I changed it to `IndexedSeq` is that `mutable.ArrayBuffer` is an `IndexedSeq` on 2.12 so you can do this without copying:
   ```
   scala> val a: IndexedSeq[Int] =  scala.collection.mutable.ArrayBuffer[Int]()
   a: IndexedSeq[Int] = ArrayBuffer()
   ```
   so I assumed that `toIndexedSeq` would be a no-op like `toSeq` in 2.12. But this seems to be wrong.
   
   But I agree that if we change `Seq` to `scala.collection.Seq` or `IndexedSeq` to `scala.collection.IndexedSeq` it would be possible to just remove all the `toSeq` or `toIndexedSeq` (except one place for toIndexSeq where we currently use a ListBuffer but that could be changed to a ArrayBuffer) and get the performance on 2.13 without getting worse on 2.12.
   
   
   > Give more information for this way: convert ArrayBuffer to immutable.Seq or immutable.IndexedSeq, we should use immutable.ArraySeq.from(buffer) or immutable.ArraySeq.unsafeWrapArray(buffer.toArray) instead of buffer.toSeq( or buffer.toIndexedSeq), from and unsafeWrapArray will be more efficient.
   
   >  At the same time, there are many other similar cases. Should we fixed them all in one
   
   So the idea here would be to define some helper that can be used to convert our mutable collection to a `Seq` and  have a no-op operation on scala 2.12  and do it using an ArraySeq on 2.13. And therefore accepting an extra copy on 2.13?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1304753660

   https://github.com/apache/spark/actions/runs/3401306708/jobs/5656324784
   
   ```
   [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.3.0! Found 5 potential problems (filtered 948)
   [error]  * method blocks()scala.collection.immutable.Seq in class org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest has a different result type in current version, where it is scala.collection.Seq rather than scala.collection.immutable.Seq
   [error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.blocks")
   [error]  * method copy(org.apache.spark.storage.BlockManagerId,scala.collection.immutable.Seq,Boolean)org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest in class org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest's type is different in current version, where it is (org.apache.spark.storage.BlockManagerId,scala.collection.Seq,Boolean)org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest instead of (org.apache.spark.storage.BlockManagerId,scala.collection.immutable.Seq,Boolean)org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest
   [error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.copy")
   [error]  * synthetic method copy$default$2()scala.collection.immutable.Seq in class org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest has a different result type in current version, where it is scala.collection.Seq rather than scala.collection.immutable.Seq
   [error]    filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.copy$default$2")
   [error]  * method this(org.apache.spark.storage.BlockManagerId,scala.collection.immutable.Seq,Boolean)Unit in class org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest's type is different in current version, where it is (org.apache.spark.storage.BlockManagerId,scala.collection.Seq,Boolean)Unit instead of (org.apache.spark.storage.BlockManagerId,scala.collection.immutable.Seq,Boolean)Unit
   [error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.this")
   [error]  * method apply(org.apache.spark.storage.BlockManagerId,scala.collection.immutable.Seq,Boolean)org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest in object org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest in current version does not have a correspondent with same parameter signature among (org.apache.spark.storage.BlockManagerId,scala.collection.Seq,Boolean)org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest, (java.lang.Object,java.lang.Object,java.lang.Object)java.lang.Object
   [error]    filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.storage.ShuffleBlockFetcherIterator#FetchRequest.apply")
   [error] java.lang.RuntimeException: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.3.0! Found 5 potential problems
   ```
   
   Seems this pr caused the binary compatibility check failure with Scala 2.13
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1295852211

   Alternate implementations feel like overkill unless this is a serious bottleneck.
   Maybe a benchmark would help understand how big the difference is in a real workload?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296152282

   https://github.com/LuciferYang/spark/blob/size-bench/core/src/test/scala/org/apache/spark/SizeBenchmark.scala
   
   I write a simple bench to test 2 the following two scenarios:
   
   1. Different `ArrayBuffer[Int]` size + call `.size` once
   2. Same `ArrayBuffer[Int]` size + call `.size` more than once
   
   The local test results are as follows (`toSeq + Size` represents without the pr, `toIndexedSeq + Size ` epresents with the pr)
   
   **Scala 2.13**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              3              4           0         34.3          29.2       1.0X
   toIndexedSeq + Size                                                       4              4           0         25.3          39.5       0.7X
   
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 10 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              11             12           1          8.9         112.1       1.0X
   toIndexedSeq + Size                                                        5              5           0         21.3          47.0       2.4X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 100 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               92             94           4          1.1         919.3       1.0X
   toIndexedSeq + Size                                                        24             25           0          4.1         241.2       3.8X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               920            924           7          0.1        9200.1       1.0X
   toIndexedSeq + Size                                                        214            215           1          0.5        2142.2       4.3X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 10000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               9005           9218         300          0.0       90053.2       1.0X
   toIndexedSeq + Size                                                        2960           2962           2          0.0       29604.2       3.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 100000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              100712         102251        2177          0.0     1007118.0       1.0X
   toIndexedSeq + Size                                                        29682          30167         687          0.0      296816.2       3.4X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 2 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               774            780           7          0.1        7744.3       1.0X
   toIndexedSeq + Size                                                        295            296           2          0.3        2947.9       2.6X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 3 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               938            941           3          0.1        9381.0       1.0X
   toIndexedSeq + Size                                                        297            302           4          0.3        2969.4       3.2X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 4 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1102           1111          13          0.1       11020.2       1.0X
   toIndexedSeq + Size                                                        299            301           2          0.3        2992.8       3.7X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 5 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1281           1290          12          0.1       12812.8       1.0X
   toIndexedSeq + Size                                                        301            313          17          0.3        3007.9       4.3X
   ```
   
   **Scala 2.12**
   
   ```
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   --------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                              1              1           0        187.5           5.3       1.0X
   toIndexedSeq + Size                                                       4              4           0         25.0          40.0       0.1X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 10 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ---------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                               1              1           0         81.0          12.4       1.0X
   toIndexedSeq + Size                                                        5              5           0         21.6          46.3       0.3X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 100 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                1              1           0         81.1          12.3       1.0X
   toIndexedSeq + Size                                                        29             29           0          3.5         289.5       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 1              1           0         80.7          12.4       1.0X
   toIndexedSeq + Size                                                        247            249           3          0.4        2468.2       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 10000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                  1              1           0         80.7          12.4       1.0X
   toIndexedSeq + Size                                                        2432           2434           3          0.0       24318.6       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 100000 and call .size 1 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -------------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                   1              1           0         80.7          12.4       1.0X
   toIndexedSeq + Size                                                        26121          26133          17          0.0      261209.8       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 2 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 1              1           0        141.9           7.0       1.0X
   toIndexedSeq + Size                                                        263            267           4          0.4        2629.3       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 3 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 2              2           0         65.7          15.2       1.0X
   toIndexedSeq + Size                                                        264            267           2          0.4        2636.7       0.0X
   
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 4 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 2              2           0         60.7          16.5       1.0X
   toIndexedSeq + Size                                                        264            267           2          0.4        2641.9       0.0X
   
   OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Mac OS X 11.4
   Apple M1
   Test size of Seq with buffer size 1000 and call .size 5 time(s):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   -----------------------------------------------------------------------------------------------------------------------------------------------
   toSeq + Size                                                                 2              2           0         55.6          18.0       1.0X
   toIndexedSeq + Size                                                        263            266           4          0.4        2629.1       0.0X
   ```
   
   From the test results, it can be seen that change to `toIndexedSeq` can improve the performance of Scala 2.13 by 3 to 4 times( So it is necessary to change to `toIndexedSeq` in long term), but it may also cause more than 10 times of performance degradation for Scala 2.12. 
   
   The performance impact is related to the size of `ArrayBuffer` and the number of calls to `.size` methods.. 
   
   I am using GA(x86) to test this scenario and update the conclusion later.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #38427: [SPARK-40950][CORE] Fix isRemoteAddressMaxedOut performance overhead on scala 2.13

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on PR #38427:
URL: https://github.com/apache/spark/pull/38427#issuecomment-1296155612

   > In long term, given we have to move to IndexedSeq for 2.13 - unless there are alternatives we are yet to consider - perhaps moving to toIndexedSeq is fine @LuciferYang ?
   
   @mridulm Since `Seq` represents `scala.collection.Seq` in Scala 2.13, maybe we can consider using `mutable.ArraySeq` instead, which is may better than using `buffer.toIndexedSeq`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org