You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dongjoon-hyun (via GitHub)" <gi...@apache.org> on 2023/10/08 20:36:59 UTC

[PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

dongjoon-hyun opened a new pull request, #43283:
URL: https://github.com/apache/spark/pull/43283

   ### What changes were proposed in this pull request?
   
   This is a follow-up of #43261 to simplify path check logic and add more test coverage.
   
   ### Why are the changes needed?
   
   To be robust.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs with the newly added test case.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349933327


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +344,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true

Review Comment:
   Oh. You're right. Thank you. Let me fix that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on PR #43283:
URL: https://github.com/apache/spark/pull/43283#issuecomment-1753337006

   Looks good to me now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349814837


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -30,6 +31,7 @@ import org.apache.spark._
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.HiveCatalogMetrics
 
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun closed pull request #43283: [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic
URL: https://github.com/apache/spark/pull/43283


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349817037


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +345,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true
+      case None =>
+        underscore.findFirstIn(path) match {
+          case Some(dir) if dir.equals("/_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_metadata", ""))

Review Comment:
   So based on updated logic, a dir looks like `"/_metadata/.ab"` is filtered out. But previously, it is not. Which one is more correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43283:
URL: https://github.com/apache/spark/pull/43283#issuecomment-1753566223

   Merged to master for Apache Spark 4.0.0. Thank you, @viirya .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349819977


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +345,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true
+      case None =>
+        underscore.findFirstIn(path) match {
+          case Some(dir) if dir.equals("/_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_metadata", ""))

Review Comment:
   New one is consistent with the existing `shouldFilterOutPathName` and correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349869031


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +344,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true

Review Comment:
   I think `/_ab=123` is not filtered out by `shouldFilterOutPathName`, but it is filtered out by this `shouldFilterOutPath`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349820345


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +345,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true
+      case None =>
+        underscore.findFirstIn(path) match {
+          case Some(dir) if dir.equals("/_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_metadata", ""))
+          case Some(dir) if dir.equals("/_common_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_common_metadata", ""))
+          case Some(_) => true
+          case None => false

Review Comment:
   Now, we don't depend on `shouldFilterOutPathName` and in the caller part. I removed the root directory to simplify this method's role.
   
   ```scala
   - }.filterNot(status => shouldFilterOutPath(status.getPath.toString))
   + }.filterNot(status => shouldFilterOutPath(status.getPath.toString.substring(prefixLength)))
   ```



##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -30,6 +31,7 @@ import org.apache.spark._
 import org.apache.spark.internal.Logging
 import org.apache.spark.metrics.source.HiveCatalogMetrics
 
+

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1350544362


##########
core/src/test/scala/org/apache/spark/util/HadoopFSUtilsSuite.scala:
##########
@@ -30,4 +30,36 @@ class HadoopFSUtilsSuite extends SparkFunSuite {
     assert(HadoopFSUtils.shouldFilterOutPathName("_cd_common_metadata"))
     assert(HadoopFSUtils.shouldFilterOutPathName("a._COPYING_"))
   }
+
+  test("SPARK-45452: HadoopFSUtils - path filtering") {
+    // Case 1: Regular and metadata paths
+    assert(!HadoopFSUtils.shouldFilterOutPath("/abcd"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/abcd/efg"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/year=2023/month=10/day=8/hour=13"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/part=__HIVE_DEFAULT_PARTITION__"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_cd=123"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_cd=123/1"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_metadata"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_metadata/1"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_common_metadata"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_common_metadata/1"))
+    // Case 2: Hidden paths and the paths ending `._COPYING_`
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab/cde"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab/_metadata/1"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/fg"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/_metadata"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/_common_metadata"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/year=2023/month=10/day=8/hour=13"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/x/.hidden/part=__HIVE_DEFAULT_PARTITION__"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/a._COPYING_"))
+    // Case 3: Understored paths (except metadata paths of Case 1)

Review Comment:
   Understored? Underscored?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43283:
URL: https://github.com/apache/spark/pull/43283#issuecomment-1752162962

   Could you review this follow-up, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43283:
URL: https://github.com/apache/spark/pull/43283#issuecomment-1752266746

   Thank you for review, @viirya . The PR is updated~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on PR #43283:
URL: https://github.com/apache/spark/pull/43283#issuecomment-1752488372

   The regular expression is updated and new test cases are added. Thank you so much, @viirya .
   ```
   [info] HadoopFSUtilsSuite:
   [info] - HadoopFSUtils - file filtering (14 milliseconds)
   [info] - SPARK-45452: HadoopFSUtils - path filtering (1 millisecond)
   [info] Run completed in 906 milliseconds.
   [info] Total number of tests run: 2
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 15 s, completed Oct 9, 2023, 12:39:34 AM
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1349816592


##########
core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala:
##########
@@ -342,13 +345,24 @@ private[spark] object HadoopFSUtils extends Logging {
     exclude && !include
   }
 
+  private val underscore: Regex = "/_[^=/]*/".r
+  private val underscoreEnd: Regex = "/_[^/]*$".r
+
   /** Checks if we should filter out this path. */
   def shouldFilterOutPath(path: String): Boolean = {
-    val exclude = (path.contains("/_") && !path.contains("=")) || path.contains("/.")
-    val include = path.contains("/_common_metadata/") ||
-      path.endsWith("/_common_metadata") ||
-      path.contains("/_metadata/") ||
-      path.endsWith("/_metadata")
-    (exclude && !include) || shouldFilterOutPathName(path)
+    if (path.contains("/.") || path.endsWith("._COPYING_")) return true
+    underscoreEnd.findFirstIn(path) match {
+      case Some(dir) if dir.equals("/_metadata") || dir.equals("/_common_metadata") => false
+      case Some(_) => true
+      case None =>
+        underscore.findFirstIn(path) match {
+          case Some(dir) if dir.equals("/_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_metadata", ""))
+          case Some(dir) if dir.equals("/_common_metadata/") =>
+            shouldFilterOutPath(path.replaceFirst("/_common_metadata", ""))
+          case Some(_) => true
+          case None => false

Review Comment:
   Hmm, I'm not sure if this is to simplify the logic, looks like it gets more complicated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45452][SQL][FOLLOWUP] Simplify path check logic [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43283:
URL: https://github.com/apache/spark/pull/43283#discussion_r1350719313


##########
core/src/test/scala/org/apache/spark/util/HadoopFSUtilsSuite.scala:
##########
@@ -30,4 +30,36 @@ class HadoopFSUtilsSuite extends SparkFunSuite {
     assert(HadoopFSUtils.shouldFilterOutPathName("_cd_common_metadata"))
     assert(HadoopFSUtils.shouldFilterOutPathName("a._COPYING_"))
   }
+
+  test("SPARK-45452: HadoopFSUtils - path filtering") {
+    // Case 1: Regular and metadata paths
+    assert(!HadoopFSUtils.shouldFilterOutPath("/abcd"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/abcd/efg"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/year=2023/month=10/day=8/hour=13"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/part=__HIVE_DEFAULT_PARTITION__"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_cd=123"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_cd=123/1"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_metadata"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_metadata/1"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_common_metadata"))
+    assert(!HadoopFSUtils.shouldFilterOutPath("/_common_metadata/1"))
+    // Case 2: Hidden paths and the paths ending `._COPYING_`
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab/cde"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/.ab/_metadata/1"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/fg"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/_metadata"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/_common_metadata"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/ab/.cde/year=2023/month=10/day=8/hour=13"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/x/.hidden/part=__HIVE_DEFAULT_PARTITION__"))
+    assert(HadoopFSUtils.shouldFilterOutPath("/a._COPYING_"))
+    // Case 3: Understored paths (except metadata paths of Case 1)

Review Comment:
   Thank you. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org