You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/01 15:26:59 UTC
[GitHub] [spark] attilapiros edited a comment on pull request #28967: [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service

attilapiros edited a comment on pull request #28967:
URL: https://github.com/apache/spark/pull/28967#issuecomment-652485109


   # Regarding the benchmark
   
   The code is in a separate commit where both solution is tested. This benchmark is not intended to be reused just to prove this one time change is well-founded and justified. 
   
   The commit is on another [branch](https://github.com/attilapiros/spark/tree/SPARK-32149-benchmark) which based on the same as the PR. And the commit with the benchmark is [here]( https://github.com/attilapiros/spark/commit/b66ffac5c8e2678d7cbb2294cb23a14e1e1e28ad).
   
   The code is:
   ```scala
   /**
    * Benchmark for NormalizedInternedPathname.
    * To run this benchmark:
    * {{{
    *   1. without sbt:
    *      bin/spark-submit --class <this class> --jars <spark core test jar>
    *   2. build/sbt "core/test:runMain <this class>"
    *   3. generate result:
    *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>"
    *      Results will be written to "benchmarks/NormalizedInternedPathname-results.txt".
    * }}}
    * */
   object NormalizedInternedPathnameBenchmark extends BenchmarkBase {
     val seed = 0x1337
   
   
     private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = {
       val numLocalDir = 100
       val numSubDir = 100
       val numFilenames = 100
       val sumPathNames = numLocalDir * numSubDir * numFilenames
       val benchmark =
         new Benchmark(s"Normalize pathnames newBefore=$newBefore", sumPathNames, output = output)
       val localDir = s"/a//b//c/d/e//f/g//$newBefore"
       val files = (1 to numLocalDir).flatMap { localDirId =>
         (1 to numSubDir).flatMap { subDirId =>
           (1 to numFilenames).map { filenameId =>
             (localDir + localDirId, subDirId.toString, s"filename_$filenameId")
           }
         }
       }
       val namedNewMethod = "new" -> normalizeNewMethod
       val namedOldMethod = "old" -> normalizeOldMethod
   
       val ((firstName, firstMethod), (secondName, secondMethod)) =
         if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, namedNewMethod)
   
   
       benchmark.addCase(
         s"Normalize with the $firstName method", numIters) { _ =>
           firstMethod(files)
       }
       benchmark.addCase(
         s"Normalize with the $secondName method", numIters) { _ =>
           secondMethod(files)
       }
       benchmark.run()
     }
   
   
     private val normalizeOldMethod = (files: Seq[(String, String, String)]) => {
       files.map { case (localDir, subDir, filename) =>
         ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, filename)
       }.size
     }
   
   
     private val normalizeNewMethod = (files: Seq[(String, String, String)]) => {
       files.map { case (localDir, subDir, filename) =>
         new File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern()
       }.size
     }
   
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
       val numIters = 25
       runBenchmark("Normalize pathnames new method first") {
         normalizePathnames(numIters, newBefore = true)
       }
       runBenchmark("Normalize pathnames old method first") {
         normalizePathnames(numIters, newBefore = false)
       }
     }
   }
   ```
   
   So it runs the new and old method for a 1000000 paths for 25 iteration then do the same for 1000000 other paths but first the old method is used then the new one. The reason behind to test both method in both way (one is first the other is second) is the assumption that string interning might be different when it is first used on a string and when there is a match second time.
   
   # The benchmark result 
   
   ```
   ================================================================================================
   Normalize pathnames new method first
   ================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   Normalize pathnames newBefore=true:       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Normalize with the new method                       252            259          10          4.0         252.2       1.0X
   Normalize with the old method                      1727           2018         162          0.6        1726.6       0.1X
   
   
   ================================================================================================
   Normalize pathnames old method first
   ================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   Normalize pathnames newBefore=false:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   Normalize with the old method                      1812           2065         153          0.6        1812.3       1.0X
   Normalize with the new method                       252            254           2          4.0         252.0       7.2X
   
   ```
   
   So the new method is about 7-10 times better than old one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org