You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/25 14:21:54 UTC

[GitHub] [arrow-datafusion] timvw opened a new issue, #3261: (Re-)add support for glob patterns in ListingTableUrl

timvw opened a new issue, #3261:
URL: https://github.com/apache/arrow-datafusion/issues/3261

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Since the much needed cleanup and rationalization in #2578 of ListingTableUrl there is only support for glob patterns when no scheme is provided (in practice: only supported on local filesystem and not on other object_stores anymore).
   
   **Describe the solution you'd like**
   To have proper support for glob patterns. Eg, updating the documentation (and implementation) for ListingTableUrl to the following:
   
       /// Parse a provided string as a `ListingTableUrl`
       ///
       /// # Glob File Paths
       ///
       /// If the path contains any of `'?', '*', '['`, it will be considered
       /// a glob expression and resolved as following:
       ///
       /// The string up to the first path segment containing a glob expression will be extracted,
       /// and resolved as any other provided string.
       ///
       /// The remaining string will be interpreted as a [`glob::Pattern`] and used as a
       /// filter when listing files from object storage
       ///
       /// # Paths without a Scheme
       ///
       /// If no scheme is provided, or the string is an absolute filesystem path
       /// as determined [`std::path::Path::is_absolute`], the string will be
       /// interpreted as a path on the local filesystem using the operating
       /// system's standard path delimiter, i.e. `\` on Windows, `/` on Unix.
       ///
       /// If you wish to specify a path that does not exist on the local
       /// machine you must provide it as a fully-qualified [file URI]
       /// e.g. `file:///myfile.txt`
       ///
       /// [file URI]: https://en.wikipedia.org/wiki/File_URI_scheme
   
   **Describe alternatives you've considered**
   We could keep things as they are and push support for globbing further into user-space.
   
   Today, when a path/string contains an '*' or '[' the user is greeted with a BadSegment error anyway.
   
   @tustvold WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw commented on issue #3261: (Re-)add support for glob patterns in ListingTableUrl

Posted by GitBox <gi...@apache.org>.
timvw commented on issue #3261:
URL: https://github.com/apache/arrow-datafusion/issues/3261#issuecomment-1227656150

   A couple of thoughts:
   
   * A ListingTableUrl is currently different than just a valid Url. 
   ** When this is not true, why not simply use Url?
   ** Perhaps TablePath was a more suitable name for this concept? 
   ** And the url field could have been an object_store::path
   
   * The "glob" syntax seems to be well-known and universally accepted to represent a set of files
   ** https://docs.rs/glob/latest/glob/struct.Pattern.html
   ** https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#globStatus-org.apache.hadoop.fs.Path-
   ** https://sceweb.uhcl.edu/liaw/odi/ostore/doc/mo/7_unix.htm
   ** https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm
   
   Anyway, instead of making all these breaking changes without too much thinking I propose to introduce a GlobbingTable which has Globs (similar to ListingTable and it's ListingTableUrl) in datafusion-contrib and see how it works out...
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #3261: (Re-)add support for glob patterns in ListingTableUrl

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #3261:
URL: https://github.com/apache/arrow-datafusion/issues/3261#issuecomment-1227360620

   The reason I didn't do this is glob characters aren't URL-safe, so something like `s3://bucket/path/*.parquet` isn't a valid URL. I could only find examples of systems that supported glob expressions to local filesystem, and so I wasn't really sure how best to encode globs in URLs and opted to just punt on it.
   
   Some possible ideas:
   
   * Just ignore that it isn't a valid URL and accept the fact it is potentially very confusing (what this ticket proposes)
   * Provide a programmatic interface to construct a `ListingTableUrl` with a custom scheme and glob
   * Encode the glob expression as a URL-encoded query parameter
   * Something else
   
   It is also potentially worth highlighting that IIRC the logical plan serialization currently doesn't handle glob expressions and just drops them on the floor.
   
   I think it would really help move this forward if we could find an example of a system that supports glob expressions to object stores, otherwise we end up having to design something custom which we will inevitably get wrong


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] timvw closed issue #3261: (Re-)add support for glob patterns in ListingTableUrl

Posted by GitBox <gi...@apache.org>.
timvw closed issue #3261: (Re-)add support for glob patterns in ListingTableUrl
URL: https://github.com/apache/arrow-datafusion/issues/3261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org