You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dongjoon-hyun (via GitHub)" <gi...@apache.org> on 2023/10/07 07:11:20 UTC

[PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

dongjoon-hyun opened a new pull request, #43261:
URL: https://github.com/apache/spark/pull/43261

…

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876686804

   > is a table on a directory (without any nesting or partitions) with 1M files?
   
   yea, no partition directories.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876205574

   The time went from 10 seconds to more than 1 minute. I don't know the exact number of files but it should be a lot as it took 10 seconds to list files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349584688


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -308,4 +340,14 @@ private[spark] object HadoopFSUtils extends Logging {
     val include = pathName.startsWith("_common_metadata") || pathName.startsWith("_metadata")
     exclude && !include
   }
+
+  /** Checks if we should filter out this path. */
+  def shouldFilterOutPath(path: String): Boolean = {
+    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
+    val include = path.contains("/_common_metadata/") ||
+      path.endsWith("/_common_metadata") ||
+      path.contains("/_metadata/") ||
+      path.endsWith("/_metadata")
+    (exclude && !include) || shouldFilterOutPathName(path)

Review Comment:
   Isn't `shouldFilterOutPathName` already covering these patterns?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349586230


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1641,6 +1641,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val USE_LISTFILES_FILESYSTEM_LIST =
+    buildConf("spark.sql.sources.useListFilesFileSystemList")
+      .doc("A comma-separated list of file system schems to use FileSystem.listFiles API " +

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876307415

   As you see in this PR description, the listing time of 51842 files was 11 seconds in my experiment.
   > For example, the improvement on a 3-year-data table with year/month/day/hour hierarchy is 158x.
   
   Given your description, I guess `100s` (the 10x result) means about `520k` files. In other words, did you try to list 0.5M files or 1M files in a single S3 prefix? Could you double-check with the internal benchmark team, please?
   
   In addition, in your benchmark case, do you have some other S3 page size or S3-compatible layer like DBFS in your benchmark environment? It could have a side-effect due to the layered file system.
   
   Please shed some lights to us. 😄 I'd like to provide a proper action on this topic because this is really helpful for the most of  tables and users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349584127


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -1641,6 +1641,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val USE_LISTFILES_FILESYSTEM_LIST =
+    buildConf("spark.sql.sources.useListFilesFileSystemList")
+      .doc("A comma-separated list of file system schems to use FileSystem.listFiles API " +

Review Comment:
   schemes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349586346


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -308,4 +340,14 @@ private[spark] object HadoopFSUtils extends Logging {
     val include = pathName.startsWith("_common_metadata") || pathName.startsWith("_metadata")
     exclude && !include
   }
+
+  /** Checks if we should filter out this path. */
+  def shouldFilterOutPath(path: String): Boolean = {
+    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
+    val include = path.contains("/_common_metadata/") ||
+      path.endsWith("/_common_metadata") ||
+      path.contains("/_metadata/") ||
+      path.endsWith("/_metadata")
+    (exclude && !include) || shouldFilterOutPathName(path)

Review Comment:
   - `shouldFilterOutPathName` only handles `file name`. 
   - `shouldFilterOutPath` handles the intermediate directory in the path like `s3://abc/_metadata/1.txt`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1877782998

   Thank you, @cloud-fan !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1751845758

   Could you review this PR when you get a chance, @viirya and @sunchao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349586427


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -150,16 +150,28 @@ object InMemoryFileIndex extends Logging {
       filter: PathFilter,
       sparkSession: SparkSession,
       parameters: Map[String, String] = Map.empty): Seq[(Path, Seq[FileStatus])] = {
-    HadoopFSUtils.parallelListLeafFiles(
-      sc = sparkSession.sparkContext,
-      paths = paths,
-      hadoopConf = hadoopConf,
-      filter = new PathFilterWrapper(filter),
-      ignoreMissingFiles =
-        new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles,
-      ignoreLocality = sparkSession.sessionState.conf.ignoreDataLocality,
-      parallelismThreshold = sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold,
-      parallelismMax = sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism)
+    val fileSystemList =
+      sparkSession.sessionState.conf.useListFilesFileSystemList.split(",").map(_.trim)
+    val ignoreMissingFiles =
+      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val useListFiles = paths.size == 1 &&

Review Comment:
   Yes, we will add that layer. Currently, this PR is designed to optimize for a single table operation like `CREATE TABLE` and `REPARE TABLE`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349586277


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -69,6 +69,38 @@ private[spark] object HadoopFSUtils extends Logging {
       ignoreMissingFiles, ignoreLocality, parallelismThreshold, parallelismMax)
   }
 
+  /**
+   * Lists a collection of paths recursively with a single API invocation.
+   * Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.
+   *
+   * This is able to be be called on both driver and executors.

Review Comment:
   Thank you. I'll update with the above typo together.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1877118698

   @dongjoon-hyun  Thanks for the confirmation! I need to check with the benchmark team and verify if it's caused by extra storage layers (e.g. DBFS) not supporting this new API well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1864013781

   Just FYI, we hit a 10x perf regression with this change in our internal benchmark with a flattened directory layout. This PR says it can replace many recursive FileSystem calls, which is not an issue for flattened directories. Shall we disable it by default? Or we only use it when scanning a file source table and the table metadata says it has more than one partition column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1751930943

   Thank you, @sunchao .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349585186


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala:
##########
@@ -150,16 +150,28 @@ object InMemoryFileIndex extends Logging {
       filter: PathFilter,
       sparkSession: SparkSession,
       parameters: Map[String, String] = Map.empty): Seq[(Path, Seq[FileStatus])] = {
-    HadoopFSUtils.parallelListLeafFiles(
-      sc = sparkSession.sparkContext,
-      paths = paths,
-      hadoopConf = hadoopConf,
-      filter = new PathFilterWrapper(filter),
-      ignoreMissingFiles =
-        new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles,
-      ignoreLocality = sparkSession.sessionState.conf.ignoreDataLocality,
-      parallelismThreshold = sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold,
-      parallelismMax = sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism)
+    val fileSystemList =
+      sparkSession.sessionState.conf.useListFilesFileSystemList.split(",").map(_.trim)
+    val ignoreMissingFiles =
+      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val useListFiles = paths.size == 1 &&

Review Comment:
   For multiple input paths, can we also use listFiles on each of them and merge returned list?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43261:
URL: https://github.com/apache/spark/pull/43261#discussion_r1349584314


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -69,6 +69,38 @@ private[spark] object HadoopFSUtils extends Logging {
       ignoreMissingFiles, ignoreLocality, parallelismThreshold, parallelismMax)
   }
 
+  /**
+   * Lists a collection of paths recursively with a single API invocation.
+   * Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.
+   *
+   * This is able to be be called on both driver and executors.

Review Comment:
   ```suggestion
      * This is able to be called on both driver and executors.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1751867870

   The last commit is about fixing two typos in the string and comment. And, the previous commit passed all tests. Let me merge this for Apache Spark 4.0.0. Thank you again, @viirya .
   
   ![Screenshot 2023-10-07 at 4 51 17 PM](https://github.com/apache/spark/assets/9700541/46f9cb27-10e8-453a-83da-150bfc346da5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #43261: [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API
URL: https://github.com/apache/spark/pull/43261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "sunchao (via GitHub)" <gi...@apache.org>.

sunchao commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1751922081

   late LGTM, the improvement looks great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1861965398

   Have we tested it on a flattened directory layout (no partitions but many files)? Does it have perf regression?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876135762

   Thank you for sharing and sorry for being late, @cloud-fan .  I was on a vacation from December 14th.
   
   Could you elaborate a little more specific? For example, how many files do you have in a single S3 prefix? What is the exact time value *before* and *after* when you say `10x`?
   > Just FYI, we hit a 10x perf regression with this change in our internal benchmark with a flattened directory layout. 
   
   Of course, we can, @cloud-fan . Before doing that, I want to reproduce your benchmark to make it sure and to improve for that case.
   > Shall we disable it by default? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876380033

   Not yet. I'm trying to understand your case.
   
   BTW, may I ask what do you mean by `a flattened directory`? So far, I thought what you mean is a table on a directory (without any nesting or partitions) with 1M files? Is it some other technique?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876366889

   > As you see in this PR description, the listing time of 51842 files was 11 seconds in my experiment.
   
   Did you try with a flattened directory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1876991565

   To @cloud-fan , I tested like the following in 100k scale. I don't see any **10x** regression which you mentioned. Given this result, could you provide your specific procedure? Otherwise, I don't think Apache Spark has an issue.
   
   1. Prepare **100k files** in a single directory
   ```bash
   $ mkdir /tmp/100k
   
   $ for i in {1..100000}; do touch /tmp/100k/$i.txt; done
   
   $ aws s3 sync /tmp/100k s3://dongjoon/100k
   
   $ aws s3 ls s3://dongjoon/100k/ --summarize | tail -n2
   Total Objects: 100000
      Total Size: 0
   ```
   
   2. Build Apache Spark 4
   ```
   $ NO_MANUAL=1 ./dev/make-distribution.sh -Phadoop-cloud,hive
   ```
   
   3. Comparison
   
   **Apache Spark 3.5.0**
   ```
   $ bin/spark-shell \
   -c spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   24/01/04 04:02:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   24/01/04 04:02:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
   Spark context Web UI available at http://localhost:4041
   Spark context available as 'sc' (master = local[*], app id = local-1704369757182).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
         /_/
   
   Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.9)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> spark.time(new org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
   24/01/04 04:02:45 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   Time taken: 12310 ms
   res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex = org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
   ```
   
   **Apache Spark 4.0.0**
   ```
   $ bin/spark-shell \
   -c spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
         /_/
   
   Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.9)
   Type in expressions to have them evaluated.
   Type :help for more information.
   24/01/04 04:06:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Spark context Web UI available at http://localhost:4040
   Spark context available as 'sc' (master = local[*], app id = local-1704369986620).
   Spark session available as 'spark'.
   
   scala> spark.time(new org.apache.spark.sql.execution.datasources.InMemoryFileIndex(spark, Seq(new org.apache.hadoop.fs.Path("s3a://dongjoon/100k/")), Map.empty, None))
   24/01/04 04:06:29 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
   24/01/04 04:06:29 WARN SDKV2Upgrade: Directly referencing AWS SDK V1 credential provider com.amazonaws.auth.profile.ProfileCredentialsProvider. AWS SDK V1 credential providers will be removed once S3A is upgraded to SDK V2
   Time taken: 12857 ms
   val res0: org.apache.spark.sql.execution.datasources.InMemoryFileIndex = org.apache.spark.sql.execution.datasources.InMemoryFileIndex(s3a://dongjoon/100k)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL] Improve `InMemoryFileIndex` to use `FileSystem.listFiles` API [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43261:
URL: https://github.com/apache/spark/pull/43261#issuecomment-1751851658

   Thank you, @viirya !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org