You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/30 18:48:10 UTC

[GitHub] [arrow-datafusion] timvw opened a new issue, #2393: Allow user to use glob/wildcard in file path

timvw opened a new issue, #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393

   Currently it is only possible to use a fully qualified path to a file or a folder, eg:
   * /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_01.csv
   * /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard
   
   In situations when a folder contains multiple "sorts" of files it would be handy to use a glob/wildcard to list relevant files, eg:
   * /Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv
   (This would allow one to exclude files such as red_01.csv in that same folder)
   
   Currently this does not seem to work due to some "incorrect/flawed" logic in local.rs / list_all / tokio::fs::metadata(&prefix).await?;
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1114314986

   The thing is, currently datafusion-objectstore-s3 already supports (some?) globbing...
   
   eg: when I a test in update s3.rs to use globbing instead of filename, it keeps working:
   
           let mut files = s3_file_system
               .list_file("data/alltypes_plain.sn*py.parquet")
               .await?;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1115169016

   For azure I can confirm that globbing does not work out of the box:
   
   Updated test(s) to assert that files are actually found (and processed)
   Meanwhile, extended documentation on how to create the storage containers
   https://github.com/datafusion-contrib/datafusion-objectstore-azure/pull/6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1114839576

   @tustvold : Good remarks/questions!
   
   This is what I can do:
   * have a look at the tests (s3 and azure) and verify that they fail when they should..
   * verify how globbing could work on local, s3, azure and hdfs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1115466140

   Currently spark does the following:
   Datasource (paths) -> (when __globPaths__ option is true) -> checkAndGlobPathIfNecessary
   when no glob pattern in path -> fs.listfiles(path)
   when glob pattern in path -> globber.glob(fs.listfiles(path))
   
   As @tustvold already suggested, adding a glob_files method to ObjectStore seems the appropriate way to implement this feature. This method should then:
   when no glob pattern in path -> simply list_files
   when glob pattern in path -> glob (list_files) (can implement this with file_stream.filter, similar to existing list_file_with_suffix)
   
   Apart from my use-case, list_file_with_suffix is another proof that there is indeed a need for (simple) globbing.
   
   Will rework my code in https://github.com/apache/arrow-datafusion/pull/2394 to conform with the above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1114308038

   Your suggestion to make this explicit on ObjectStore makes a lot of sense (was not aware of the other implementations and the datafusion-contrib project till now).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1114294530

   I'm not sure how I feel about having the local object store treat paths differently from the other implementations. Perhaps we should consistently support glob expressions as part of the ObjectStore trait, or not? Having a mix seems unfortunate...
   
   FYI @matthewmturner @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1115177722

   timvm -- I wonder if you could do the "globbing" at a higher level (aka what interface would the user decide to type in `/Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv`)? 
   
   So rather than passing `/Users/timvw/src/github/arrow-datafusion/testing/data/wildcard/green_*.csv` to datafusion directly, you can implement something that uses the object store's `list_all` feature to implement whatever globbing semantics you wanted, and then pass the fully resolved list of files to datafusion?
   
   I am not sure if this would work for your usecase


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold closed issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #2393: Allow user to use glob/wildcard in file path
URL: https://github.com/apache/arrow-datafusion/issues/2393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1114676891

   Does this work correctly with multiple files, or wildcards in the path and not the file? I'm very surprised to see this working.
   
   I ask as it is just calling out to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html which doesn't support wildcards, let alone glob expressions. S3 doesn't even have a defined concept of a directory, so globbing has to be implemented client side.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1115526997

   With the current code my use-case works:
   
   ```rust
       let ctx = SessionContext::new();
       let nycdata = "/Users/timvw/nyc/trip data";
       let yellow = format!("{}/yellow_tripdata_2018-0[345].csv", nycdata);
       let options = CsvReadOptions::new();
       let df = ctx.read_csv(yellow, options).await?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #2393: Allow user to use glob/wildcard in file path

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #2393:
URL: https://github.com/apache/arrow-datafusion/issues/2393#issuecomment-1115180345

   @alamb That seems indeed to be the way forward. Will also have a look at hadoop fs to find some inspiration on how they implemented that (across filesystem implementations)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org