You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/30 20:22:16 UTC

[GitHub] [arrow-datafusion] timvw opened a new pull request, #2394: Issue 2393

timvw opened a new pull request, #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394

   # Which issue does this PR close?
   Closes #2393.
   
   # What changes are included in this PR?
   Use glob instead of manually iterating over files.
   
   # Are there any user-facing changes?
   Currently two tests fail (and not obvious to me why this happens)
       planner::test::distributed_hash_aggregate_plan
       planner::test::distributed_join_plan
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865127937


##########
dev/pr-checks.sh:
##########
@@ -0,0 +1,22 @@
+#!/bin/bash

Review Comment:
   I've renamed the script to "format-code" as that is what it's really doing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865128429


##########
dev/pr-checks.sh:
##########
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+set -e
+cargo fmt --all -- --check
+#cargo clippy --all-targets --workspace -- -D warnings

Review Comment:
   Whoops. Forgot to uncomment that line for testing each of the commands. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865386752


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   Applied the change as you suggested. Thanks for your patience with me :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r863563233


##########
data-access/src/object_store/mod.rs:
##########
@@ -91,6 +92,39 @@ pub trait ObjectStore: Sync + Send + Debug {
         })))
     }
 
+    /// Calls `list_file` with a glob_pattern
+    async fn glob_file(&self, path: &str) -> Result<FileMetaStream> {
+        const GLOB_CHARS: [char; 7] = ['{', '}', '[', ']', '*', '?', '\\'];
+
+        /// Determine whether the path contains a globbing character
+        fn is_glob_path(path: &str) -> bool {
+            path.chars().any(|c| GLOB_CHARS.contains(&c))
+        }
+
+        if !is_glob_path(path) {
+            self.list_file(path).await
+        } else {
+            // take path up to first occurence of a glob char
+            let path_to_first_glob_character: Vec<&str> =
+                path.splitn(2, |c| GLOB_CHARS.contains(&c)).collect();
+            // find last occurrence of folder /
+            let path_parts: Vec<&str> = path_to_first_glob_character[0]

Review Comment:
   I'm not intimately familiar with the Rust filesystem APIs, but relying on a '/' path separator feels like it might cause excitement on windows.
   
   Perhaps this could use https://doc.rust-lang.org/std/path/struct.Path.html#method.parent instead?



##########
data-access/src/object_store/mod.rs:
##########
@@ -80,7 +81,7 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
+        let file_stream = self.glob_file(prefix).await?;

Review Comment:
   It seems confusing if `list_file` doesn't resolve globs, but `list_file_with_suffix` does. Perhaps we could add a `glob_file_with_suffix`, to avoid changing the behavior of a pre-existing method? This could then do something like
   
   ```
   match (is_glob_path(prefix), suffix.is_empty()) {
     (false, true) => self.list_file(prefix).await,
     (false, false) => self.list_file_with_suffix(prefix, suffix).await,
     (true, _) => { // glob logic }
   }
   ```
   
   `glob_file` could then just be
   
   ```
   async fn glob_file(&self, prefix: &str) -> Result<FileMetaStream> {
     self.glob_file_with_suffix(prefix, "").await
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865123682


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =

Review Comment:
   I don't think you need to collect this iterator



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1117638809

   Fixed clippy issues + documented the checks that are executed during a PR build (+ added script to execute them).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865175287


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator

Review Comment:
   Because we want to build a path as long as possible (which means including a trailing separator). 
   
   Eg: On S3 this saves us an additional lookup 
   aws s3 ls s3://bucket/foo -> PRE foo
   aws s3 ls s3://bucket/foo/ -> files in foo..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1117408894

   Tests pass now on my windows machine as well :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865329768


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   try_filter doesn't filter out errors, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7f679b078eeaf042970fc4ce212640e
   
   So this could be simplified to something like
   
   ```
   use futures::future::ready;
   use futures::stream::TryStreamExt;
   
   ...
   
   file_stream.try_filter(move |f| ready(f.path().end_with(&suffix)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r863597103


##########
data-access/src/object_store/mod.rs:
##########
@@ -91,6 +92,39 @@ pub trait ObjectStore: Sync + Send + Debug {
         })))
     }
 
+    /// Calls `list_file` with a glob_pattern
+    async fn glob_file(&self, path: &str) -> Result<FileMetaStream> {

Review Comment:
   ```suggestion
       async fn glob_files(&self, path: &str) -> Result<FileMetaStream> {
   ```
   
   Should probably be plural for consistency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865268554


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,118 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {

Review Comment:
   >  it will not pass-through the errors anymore
   
   It should do? "All errors are passed through without filtering in this combinator."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865265359


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,118 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {

Review Comment:
   Rebased to handle the merge conflict



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865164114


##########
data-access/src/object_store/local.rs:
##########
@@ -230,4 +230,30 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn test_globbing() -> Result<()> {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865209032


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,118 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern = Path::new(glob_pattern).components();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            let component_str_is_glob = contains_glob_start_char(component_as_str);
+            if component_str_is_glob {

Review Comment:
   ```suggestion
               if contains_glob_start_char(component_as_str) {
   ```



##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,118 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {

Review Comment:
   try_filter would perhaps be nicer https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_filter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865135230


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {

Review Comment:
   Because I'm still learning how to write idiomatic rust :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865329768


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   try_filter doesn't filter out errors, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7f679b078eeaf042970fc4ce212640e
   
   So this could be simplified to something like
   
   ```
   use futures::future::ready;
   use futures::stream::TryStreamExt;
   
   ...
   
   file_stream.try_filter(|f| ready(f.path().end_with(&suffix)))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865263570


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,118 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+fn filter_suffix(file_stream: FileMetaStream, suffix: &str) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {

Review Comment:
   This is the code that was already in place. 
   
   When I update the code to use try_filter, it will not pass-through the errors anymore (not sure if anyone was actually doing something with them though..)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2394: Issue 2393: Support glob patterns for (local) file(s)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1114289132

   I think the code now mixes blocking IO, as performed by glob, and non-blocking IO as performed by tokio::fs. I think we should consistently use one or the other...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865329768


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   try_filter doesn't filter out errors, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=f7f679b078eeaf042970fc4ce212640e
   
   So this could be simplified to something like
   
   ```
   use futures::future::ready;
   use futures::stream::TryStreamExt;
   
   ...
   
   file_stream.try_filter(move |f| ready(f.path().end_with(&suffix)))
   ```
   
   TBC it isn't a blocker, just noticed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865302923


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   @tustvold -> original code (above) where the errors were allowed to pass (Would be surprised to see anyone doing something meaningful with those..)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1116353125

   Afaik this should be good to go now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865164594


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator
+        if path_buf_for_longest_search_path_without_glob_pattern
+            .components()
+            .count()
+            > 1
+        {
+            result.push(path::MAIN_SEPARATOR);
+        }
+        result
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_is_glob_path() -> Result<()> {
+        assert!(!contains_glob_start_char("/"));
+        assert!(!contains_glob_start_char("/test"));
+        assert!(!contains_glob_start_char("/test/"));
+        assert!(contains_glob_start_char("/test*"));
+        Ok(())
+    }
+
+    fn test_longest_base_path(input: &str, expected: &str) {
+        assert_eq!(
+            find_longest_search_path_without_glob_pattern(input),
+            make_expected(input, expected),
+            "testing find_longest_search_path_without_glob_pattern with {}",
+            input
+        );
+    }
+
+    fn make_expected(input: &str, expected: &str) -> String {

Review Comment:
   Updated the code such that this updating/fixing of expected values is not needed anymore



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865127065


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator

Review Comment:
   Why is this necessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for (local) file(s)

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1114306230

   Had a look at the implementation of tokio fs operations, and copied their (non-public) asyncify method to run the globbing on...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865302923


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   @tustvold -> original code where the errors were allowed to pass (Would be surprised to see anyone doing something meaningful with those..)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865132094


##########
data-access/src/object_store/local.rs:
##########
@@ -230,4 +230,30 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn test_globbing() -> Result<()> {

Review Comment:
   Might want a test somewhere of globstar (i.e. **)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865161844


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator
+        if path_buf_for_longest_search_path_without_glob_pattern
+            .components()
+            .count()
+            > 1
+        {
+            result.push(path::MAIN_SEPARATOR);
+        }
+        result
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_is_glob_path() -> Result<()> {
+        assert!(!contains_glob_start_char("/"));
+        assert!(!contains_glob_start_char("/test"));
+        assert!(!contains_glob_start_char("/test/"));
+        assert!(contains_glob_start_char("/test*"));
+        Ok(())
+    }
+
+    fn test_longest_base_path(input: &str, expected: &str) {
+        assert_eq!(
+            find_longest_search_path_without_glob_pattern(input),

Review Comment:
   No. The glob patterns are os agnostic and described here: https://docs.rs/glob/latest/glob/struct.Pattern.html
   
   When we build the largest path using Path(Buf), rust prefers to use \ as path separator on windows. 
   I have updated the expected values such that they are OS-independent, eg: 
   &format!("{MAIN_SEPARATOR}a{MAIN_SEPARATOR}") instead of "/a/" (on unix) or "\a\a" (on windows)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865184541


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator

Review Comment:
   Because we need to build a path (or key) as long as possible (which allows us to minimise the amount of list operations).
   
   Eg:
   aws s3 ls s3://bucket/foo -> PRE foo
   aws s3 ls s3://bucket/foo/ -> files in foo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1118219942

   Thanks again :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for (local) file(s)

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1114120096

    Resolved failing clippy, lint and cargo fmt errors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865114136


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(

Review Comment:
   I don't think this function needs to be async



##########
dev/pr-checks.sh:
##########
@@ -0,0 +1,22 @@
+#!/bin/bash

Review Comment:
   Perhaps we could update the CI to run this script, to prevent it from drifting?



##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator
+        if path_buf_for_longest_search_path_without_glob_pattern
+            .components()
+            .count()
+            > 1
+        {
+            result.push(path::MAIN_SEPARATOR);
+        }
+        result
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_is_glob_path() -> Result<()> {
+        assert!(!contains_glob_start_char("/"));
+        assert!(!contains_glob_start_char("/test"));
+        assert!(!contains_glob_start_char("/test/"));
+        assert!(contains_glob_start_char("/test*"));
+        Ok(())
+    }
+
+    fn test_longest_base_path(input: &str, expected: &str) {
+        assert_eq!(
+            find_longest_search_path_without_glob_pattern(input),

Review Comment:
   Does this mean that windows uses glob expressions with `/` as opposed to `\`?



##########
dev/pr-checks.sh:
##########
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+set -e
+cargo fmt --all -- --check
+#cargo clippy --all-targets --workspace -- -D warnings

Review Comment:
   ?



##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator
+        if path_buf_for_longest_search_path_without_glob_pattern
+            .components()
+            .count()
+            > 1
+        {
+            result.push(path::MAIN_SEPARATOR);
+        }
+        result
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_is_glob_path() -> Result<()> {
+        assert!(!contains_glob_start_char("/"));
+        assert!(!contains_glob_start_char("/test"));
+        assert!(!contains_glob_start_char("/test/"));
+        assert!(contains_glob_start_char("/test*"));
+        Ok(())
+    }
+
+    fn test_longest_base_path(input: &str, expected: &str) {
+        assert_eq!(
+            find_longest_search_path_without_glob_pattern(input),
+            make_expected(input, expected),
+            "testing find_longest_search_path_without_glob_pattern with {}",
+            input
+        );
+    }
+
+    fn make_expected(input: &str, expected: &str) -> String {

Review Comment:
   Some comments would help interpret what this is doing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865124978


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {

Review Comment:
   Why not just break?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1117651113

   Perhaps we could split out the formatting and CI changes into a separate PR, as I think they're someone parallel to the changes in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865207251


##########
data-access/src/object_store/mod.rs:
##########
@@ -102,3 +132,113 @@ pub trait ObjectStore: Sync + Send + Debug {
     /// Get object reader for one file
     fn file_reader(&self, file: SizedFile) -> Result<Arc<dyn ObjectReader>>;
 }
+
+const GLOB_START_CHARS: [char; 3] = ['?', '*', '['];
+
+/// Determine whether the path contains a globbing character
+fn contains_glob_start_char(path: &str) -> bool {
+    path.chars().any(|c| GLOB_START_CHARS.contains(&c))
+}
+
+/// Filters the file_stream to only contain files that end with suffix
+async fn filter_suffix(
+    file_stream: FileMetaStream,
+    suffix: &str,
+) -> Result<FileMetaStream> {
+    let suffix = suffix.to_owned();
+    Ok(Box::pin(file_stream.filter(move |fr| {
+        let has_suffix = match fr {
+            Ok(f) => f.path().ends_with(&suffix),
+            Err(_) => true,
+        };
+        async move { has_suffix }
+    })))
+}
+
+fn find_longest_search_path_without_glob_pattern(glob_pattern: &str) -> String {
+    // in case the glob_pattern is not actually a glob pattern, take the entire thing
+    if !contains_glob_start_char(glob_pattern) {
+        glob_pattern.to_string()
+    } else {
+        // take all the components of the path (left-to-right) which do not contain a glob pattern
+        let components_in_glob_pattern =
+            Path::new(glob_pattern).components().collect::<Vec<_>>();
+        let mut path_buf_for_longest_search_path_without_glob_pattern = PathBuf::new();
+        let mut encountered_glob = false;
+        for component_in_glob_pattern in components_in_glob_pattern {
+            let component_as_str =
+                component_in_glob_pattern.as_os_str().to_str().unwrap();
+            if !encountered_glob {
+                let component_str_is_glob = contains_glob_start_char(component_as_str);
+                encountered_glob = component_str_is_glob;
+                if !encountered_glob {
+                    path_buf_for_longest_search_path_without_glob_pattern
+                        .push(component_in_glob_pattern);
+                }
+            }
+        }
+
+        let mut result = path_buf_for_longest_search_path_without_glob_pattern
+            .to_str()
+            .unwrap()
+            .to_string();
+        // when we're not at the root, append a separator

Review Comment:
   Hmm... Fair enough, this has actually made me realise something that is a touch off in the current implementation w.r.t prefixes - will raise a ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on code in PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#discussion_r865339446


##########
data-access/src/object_store/mod.rs:
##########
@@ -80,15 +83,42 @@ pub trait ObjectStore: Sync + Send + Debug {
         prefix: &str,
         suffix: &str,
     ) -> Result<FileMetaStream> {
-        let file_stream = self.list_file(prefix).await?;
-        let suffix = suffix.to_owned();
-        Ok(Box::pin(file_stream.filter(move |fr| {
-            let has_suffix = match fr {
-                Ok(f) => f.path().ends_with(&suffix),
-                Err(_) => true,

Review Comment:
   Correct. 
   
   Should have read the docs a bit better, clearly stated there: https://docs.rs/futures/0.3.1/futures/stream/trait.TryStreamExt.html#method.try_filter -> All errors are passed through without filtering in this combinator. Learning something new every day :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on pull request #2394: Issue 2393: Support glob patterns for files

Posted by GitBox <gi...@apache.org>.
timvw commented on PR #2394:
URL: https://github.com/apache/arrow-datafusion/pull/2394#issuecomment-1116182835

   Trying to find the location where to put tests to verify that globbing works:
   
   ctx.read_csv("globbedpath", CsvReadOptions::new()) -> should return data as expected
   ctx.register_csv("aggregate_test_100", "globbedpath", CsvReadOptions::new()) -> should return data as expected (when queried)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org