You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/23 08:00:58 UTC

[GitHub] [spark] LuciferYang opened a new pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

LuciferYang opened a new pull request #35622:
URL: https://github.com/apache/spark/pull/35622


   ### What changes were proposed in this pull request?
   There are some duplicate codes in methods `catalyst.util. fileToString` and `catalyst.util. resourceToBytes`, this pr refactored the above two methods to de-duplicate codes.
   
   
   
   
   ### Why are the changes needed?
   de-duplicate codes
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GA
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #35622:
URL: https://github.com/apache/spark/pull/35622


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049534171


   Thank you, @LuciferYang . Please put your result and benchmark code into the PR description in order to make them as a commit log.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813016663



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       It has been changed to call `ByteStreams.toByteArray`, but do we still need to extract a private method? It seems that there are not many duplicate codes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049534171


   Thank you, @LuciferYang . Please put your benchmark code and result into the PR description in order to make them as a commit log.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049874037


   Seems fine, I don't think a benchmark is important in cases like this but we have it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813392128



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -48,42 +50,23 @@ package object util extends Logging {
 
   def fileToString(file: File, encoding: Charset = UTF_8): String = {
     val inStream = new FileInputStream(file)
-    val outStream = new ByteArrayOutputStream
     try {
-      var reading = true
-      while ( reading ) {
-        inStream.read() match {
-          case -1 => reading = false
-          case c => outStream.write(c)
-        }
-      }
-      outStream.flush()
-    }
-    finally {
+      new String(ByteStreams.toByteArray(inStream), encoding)
+    } finally {
       inStream.close()
     }
-    new String(outStream.toByteArray, encoding)
+

Review comment:
       Shall we delete this empty line?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049461442


   ```scala
   
   def testToByteArray(fileSie: Int): Unit = {
   
       val bytes = RandomUtils.nextBytes(fileSie)
       val file0 = File.createTempFile(s"$fileSie-${UUID.randomUUID()}", ".dat")
       val file1 = File.createTempFile(s"$fileSie-${UUID.randomUUID()}", ".dat")
       file0.deleteOnExit()
       file1.deleteOnExit()
       FileUtils.writeByteArrayToFile(file0, bytes)
       FileUtils.writeByteArrayToFile(file1, bytes)
   
       val benchmark = new Benchmark(
         s"ToByteArray with $fileSie ",
         1,
         output = output)
   
       benchmark.addCase("toByteArray") { _: Int =>
         toByteArray(file0)
       }
   
       benchmark.addCase("toByteArray Use Guava") { _: Int =>
         toByteArrayUseGuava(file1)
       }
       benchmark.run()
     }
   
     private def toByteArrayUseGuava(file: File): Array[Byte] = {
       val inStream = new FileInputStream(file)
       try {
         ByteStreams.toByteArray(inStream)
       } finally {
         inStream.close()
       }
     }
   
     private def toByteArray(file: File): Array[Byte] = {
       val inStream = new FileInputStream(file)
       val outStream = new ByteArrayOutputStream
       try {
         var reading = true
         while (reading) {
           inStream.read() match {
             case -1 => reading = false
             case c => outStream.write(c)
           }
         }
         outStream.flush()
       } finally {
         inStream.close()
       }
       outStream.toByteArray
     }
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
       Seq(1024, 1024 * 1024, 1024 * 1024 * 10, 1024 * 1024 * 100).foreach { fileSize =>
         testToByteArray(fileSize)
       }
     }
   ```
   
   I write a micro bench as above to compare the old method and ` ByteStreams.toByteArray`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813030085



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       [9e0298e](https://github.com/apache/spark/pull/35622/commits/9e0298ece38c3ff0eeeeaffbfca6f9be9aa99bcc) change to call `ByteStreams.toByteArray`  directly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813041255



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       hmm, but we actually use this code because Spark use Guava 14.0.1:
   
   https://github.com/google/guava/blob/c3bd8eb49f9b355a140cbb56ab592e973fea4c0d/guava/src/com/google/common/io/ByteStreams.java#L200-L215
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r812993413



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def writeToByteArrayOutputStream(inStream: InputStream): ByteArrayOutputStream = {

Review comment:
       Looks like the new private method is always used with .toByteArray; can it just return the byte array?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r812984712



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,19 @@ package object util extends Logging {
     file
   }
 
+  private def writeToByteArrayOutputStream(inStream: InputStream): ByteArrayOutputStream =
+    Utils.tryWithResource(new ByteArrayOutputStream) { outStream =>

Review comment:
       [3307d28](https://github.com/apache/spark/pull/35622/commits/3307d283798010452f35e41adc1ce374e9508355) fix this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1050018778


   thanks all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813046220



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       Ah true; that implementation is still better as it copies whole buffers at a time. I think it's preferable to use, also for simplicity.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049462434


   It seems that `ByteStreams.toByteArray` is faster @dongjoon-hyun @srowen 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049535387


   > Thank you, @LuciferYang . Please put your benchmark code and result into the PR description in order to make them as a commit log.
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1050013197


   Merged to master. Thank you, @LuciferYang and @srowen .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049461647


   The result as follows:
   ```
   Running benchmark: ToByteArray with 1024 
     Running case: toByteArray
     Stopped after 4491 iterations, 2000 ms
     Running case: toByteArray Use Guava
     Stopped after 148518 iterations, 2000 ms
   
   OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Mac OS X 11.4
   Apple M1
   ToByteArray with 1024 :                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toByteArray                                           0              0           0          0.0      424250.0       1.0X
   toByteArray Use Guava                                 0              0           0          0.1       11416.0      37.2X
   
   Running benchmark: ToByteArray with 1048576 
     Running case: toByteArray
     Stopped after 5 iterations, 2182 ms
     Running case: toByteArray Use Guava
     Stopped after 6650 iterations, 2000 ms
   
   OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Mac OS X 11.4
   Apple M1
   ToByteArray with 1048576 :                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toByteArray                                         430            436          10          0.0   429517208.0       1.0X
   toByteArray Use Guava                                 0              0           0          0.0      273167.0    1572.4X
   
   Running benchmark: ToByteArray with 10485760 
     Running case: toByteArray
     Stopped after 2 iterations, 8609 ms
     Running case: toByteArray Use Guava
     Stopped after 542 iterations, 2001 ms
   
   OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Mac OS X 11.4
   Apple M1
   ToByteArray with 10485760 :               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toByteArray                                        4300           4305           7          0.0  4299691500.0       1.0X
   toByteArray Use Guava                                 3              4           1          0.0     3324875.0    1293.2X
   
   Running benchmark: ToByteArray with 104857600 
     Running case: toByteArray
     Stopped after 2 iterations, 86419 ms
     Running case: toByteArray Use Guava
     Stopped after 53 iterations, 2007 ms
   
   OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Mac OS X 11.4
   Apple M1
   ToByteArray with 104857600 :              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   ------------------------------------------------------------------------------------------------------------------------
   toByteArray                                       43197          43210          18          0.0 43197043042.0       1.0X
   toByteArray Use Guava                                33             38           6          0.0    33091375.0    1305.4X
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang edited a comment on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049461442


   ```scala
   
   def testToByteArray(fileSie: Int): Unit = {
   
       val bytes = RandomUtils.nextBytes(fileSie)
       val file0 = File.createTempFile(s"$fileSie-${UUID.randomUUID()}", ".dat")
       val file1 = File.createTempFile(s"$fileSie-${UUID.randomUUID()}", ".dat")
       file0.deleteOnExit()
       file1.deleteOnExit()
       FileUtils.writeByteArrayToFile(file0, bytes)
       FileUtils.writeByteArrayToFile(file1, bytes)
   
       val benchmark = new Benchmark(
         s"ToByteArray with $fileSie ",
         1,
         output = output)
   
       benchmark.addCase("toByteArray") { _: Int =>
         toByteArray(file0)
       }
   
       benchmark.addCase("toByteArray Use Guava") { _: Int =>
         toByteArrayUseGuava(file1)
       }
       benchmark.run()
     }
   
     private def toByteArrayUseGuava(file: File): Array[Byte] = {
       val inStream = new FileInputStream(file)
       try {
         ByteStreams.toByteArray(inStream)
       } finally {
         inStream.close()
       }
     }
   
     private def toByteArray(file: File): Array[Byte] = {
       val inStream = new FileInputStream(file)
       val outStream = new ByteArrayOutputStream
       try {
         var reading = true
         while (reading) {
           inStream.read() match {
             case -1 => reading = false
             case c => outStream.write(c)
           }
         }
         outStream.flush()
       } finally {
         inStream.close()
       }
       outStream.toByteArray
     }
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
      // test 1K, 1M, 10M, 100M
       Seq(1024, 1024 * 1024, 1024 * 1024 * 10, 1024 * 1024 * 100).foreach { fileSize =>
         testToByteArray(fileSize)
       }
     }
   ```
   
   I write a micro bench as above to compare the old method and ` ByteStreams.toByteArray`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r812974782



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,19 @@ package object util extends Logging {
     file
   }
 
+  private def writeToByteArrayOutputStream(inStream: InputStream): ByteArrayOutputStream =
+    Utils.tryWithResource(new ByteArrayOutputStream) { outStream =>

Review comment:
       It is my bad, let me revert this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049789747


   > Thank you, @LuciferYang . Please put your benchmark code and result into the PR description in order to make them as a commit log.
   
   @dongjoon-hyun done, the bench result is produced using GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049353959


   I don't believe it was benchmarked, but copying chunks of bytes at a time rather than one at a time has to be faster - a tiny bit in the case the input stream is buffered anyway, a lot in the case that it isn't


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r812971363



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,19 @@ package object util extends Logging {
     file
   }
 
+  private def writeToByteArrayOutputStream(inStream: InputStream): ByteArrayOutputStream =
+    Utils.tryWithResource(new ByteArrayOutputStream) { outStream =>

Review comment:
       This no longer closes the inStream - I think that's good to keep




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813006627



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       OK, hah, now that I look at it - this is a pretty inefficient way to copy the byte stream. It should be read and written in chunks, not a byte at a time.
   
   That's not hard to implement, but Guava already does this:
   https://github.com/google/guava/blob/a0e2577de61a0d7e8a3dd075be66a31c93ea0446/android/guava/src/com/google/common/io/ByteStreams.java#L173
   
   In fact we use ByteStreams.toByteArray in several places. Just use that? it'll be simpler and more efficient




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang edited a comment on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang edited a comment on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049583227


   Waiting for GA to output the benchmark results
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r813055262



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def toByteArray(inStream: InputStream): Array[Byte] = {
+    val outStream = new ByteArrayOutputStream
+    try {
+      var reading = true
+      while (reading) {
+        inStream.read() match {

Review comment:
       Yes, it looks better
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a change in pull request #35622: [SPARK-38300][SQL] Refactor `fileToString` and `resourceToBytes` in catalyst.util to de-duplicate codes

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on a change in pull request #35622:
URL: https://github.com/apache/spark/pull/35622#discussion_r812998590



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
##########
@@ -100,6 +70,23 @@ package object util extends Logging {
     file
   }
 
+  private def writeToByteArrayOutputStream(inStream: InputStream): ByteArrayOutputStream = {

Review comment:
       [0fd84ad](https://github.com/apache/spark/pull/35622/commits/0fd84adee015137fead174122f1891972ba97f5a) rename this method to `toByteArray` and change return type to `Array[Byte]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049438038


   @dongjoon-hyun I remove `At the same time, the new method has better efficiency.` from pr description first and I'll investigate and give a conclusion.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on pull request #35622: [SPARK-38300][SQL] Use `ByteStreams.toByteArray` to simplify `fileToString` and `resourceToBytes` in catalyst.util

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on pull request #35622:
URL: https://github.com/apache/spark/pull/35622#issuecomment-1049583227


   Waiting for GA to output the benchmark results and upload
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org