You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/03 08:45:50 UTC

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r863563233


##########
data-access/src/object_store/mod.rs:
##########
@@ -91,6 +92,39 @@ pub trait ObjectStore: Sync + Send + Debug {
         })))
     }
 
+    /// Calls `list_file` with a glob_pattern
+    async fn glob_file(&self, path: &str) -> Result<FileMetaStream> {
+        const GLOB_CHARS: [char; 7] = ['{', '}', '[', ']', '*', '?', '\\'];
+
+        /// Determine whether the path contains a globbing character
+        fn is_glob_path(path: &str) -> bool {
+            path.chars().any(|c| GLOB_CHARS.contains(&c))
+        }
+
+        if !is_glob_path(path) {
+            self.list_file(path).await
+        } else {
+            // take path up to first occurence of a glob char
+            let path_to_first_glob_character: Vec<&str> =
+                path.splitn(2, |c| GLOB_CHARS.contains(&c)).collect();
+            // find last occurrence of folder /
+            let path_parts: Vec<&str> = path_to_first_glob_character[0]

Review Comment:
   I'm not intimately familiar with the Rust filesystem APIs, but relying on a '/' path separator feels like it might cause excitement on windows.
   
   Perhaps this could use https://doc.rust-lang.org/std/path/struct.Path.html#method.parent instead?



##########
data-access/src/object_store/mod.rs:
##########
@@ -80,7 +81,7 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
+        let file_stream = self.glob_file(prefix).await?;

Review Comment:
   It seems confusing if `list_file` doesn't resolve globs, but `list_file_with_suffix` does. Perhaps we could add a `glob_file_with_suffix`, to avoid changing the behavior of a pre-existing method? This could then do something like
   
   ```
   match (is_glob_path(prefix), suffix.is_empty()) {
     (false, true) => self.list_file(prefix).await,
     (false, false) => self.list_file_with_suffix(prefix, suffix).await,
     (true, _) => { // glob logic }
   }
   ```
   
   `glob_file` could then just be
   
   ```
   async fn glob_file(&self, prefix: &str) -> Result<FileMetaStream> {
     self.glob_file_with_suffix(prefix, "").await
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org