You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/04 19:48:22 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue, #2445: ObjectStore Directory Semantics

tustvold opened a new issue, #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   `LocalFileSystem` interprets the prefix passed to `ObjectStore::list_file` as the path to a directory, and then proceeds to enumerate this directory recursively. `S3FileSystem`, however, interprets the prefix as a string prefix.
   
   The distinction arises if you consider a file structure like
   
   ```
   foo/a.txt
   foo/b.txt
   ```
   
   If called with a prefix of `fo`, `LocalFileSystem` will return an error, whereas `S3FileSystem` will return both files.
   
   **Describe the solution you'd like**
   
   I personally would expect something called `ObjectStore` to behave like an object store, and not a filesystem. In particular I would expect it to behave like a KV store without any notion of directories.
   
   I would therefore suggest:
   
   * Remove FileSystem from the naming of the implementations
   * Map an object store to filesystem semantics, as opposed to mapping filesystem semantics to object storage
   
   **Describe alternatives you've considered**
   
   We could instead call the trait something like `FileSystem` and give is file system like semantics
   
   **Additional context**
   
   I noticed this whilst reviewing https://github.com/apache/arrow-datafusion/pull/2394 - it seems off to me that we should need to split based on path delimiters given object stores don't have such a concept.
   
   FYI @matthewmturner @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119938026

   I do agree that the capabilities we actually need are rather limited (compared to a full filesystem spec) and it makes sense to not name those FileSystem then. Should we also define what we expect in terms of ACID properties?
   
   
   @alamb The globbing is mainly relevant in raw/ingestion folders... 
   
   Eg: we have end up with a structure such as:
   /nyc-taxidata/input/yellow_tripdata_2021-11.csv
   /nyc-taxidata/input/yellow_tripdata_2021-12.csv
   /nyc-taxidata/input/yellow_tripdata_2022-01.csv
   /nyc-taxidata/input/green_tripdata_2021-12.csv
   /nyc-taxidata/input/green_tripdata_2022-01.csv
   /nyc-taxidata/input/green_tripdata_2022-02.csv
   
   In a typical job we would then process and prepare the data for consumption:
   /nyc-taxidata/accepted/yellow_tripdata/year=2022/month=1/blah.parquet
   /nyc-taxidata/accepted/green_tripdata/year=2022/month=1/blah.parquet
   
   I don't need access to all sorts of key filters (compared to all key filters in a system such as [HBase](https://hbase.apache.org/2.3/apidocs/index.html) but globbing is not something I would push back to the end-user (In hadoop this is also supported by alternative (s3, azure) hadoop filesystem implementations)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119804996

   If we like the IOx object store interface and want to reuse the implementation, I can probably see about getting it published to crates.io, just let me know. It wasn't my intent with this issue, rather I just wanted clarity on what I should be reviewing :sweat_smile:, but I would be happy to help make it happen if there is consensus on it being a good idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold closed issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

tustvold closed issue #2445: ObjectStore Directory Semantics
URL: https://github.com/apache/arrow-datafusion/issues/2445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119739041

   > Are there examples of ObjectStore implementations
   
   I'm not sure what you mean by this, but object stores are really just key value stores with a vaguely RESTful API, i.e.
   
   * PutObject - associate an object (set of bytes) with a string key, replacing any existing value
   * GetObject - get the object associated with a key
   * CopyObject - copy the object associated with one key, to another
   * ListObjects - list the keys with a given prefix
   * DeleteObject - delete the value with a given key
   
   There are more complex APIs for things like multipart uploads, bucket creation, etc... but in terms of what a client would be interested in that is the entirety of the API. To put it another way, **the interface of object storage is significantly less expressive than that of a filesystem**.
   
   Trying to make object storage behave exactly like a filesystem is impossible (e.g. S3 doesn't support CreateIfNotExists), however, my thesis is that no query engine actually wants filesystem semantics, and this is why these linked abstractions **kind of** work (https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800).
   
   My suggestion is that by instead implementing the less expressive object storage semantics, we can avoid a whole host of funky edge-cases around directories, paths, etc...
   
   > in a way compatible with other systems that may use a FileSystem approach
   
   Could you expand on what you mean by this, do you mean being able to read data written by another system which should be trivial, or are you talking about some sort of API-level integration like FFI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] wjones127 commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119699907

   Also relevant:
    * #2246
    * #2185
   
   As part of that PR, I plan on creating a generic suite of tests to validate a `ObjectStore` implementation, and that could enforce these behavior expectations for each implementation.
   
   For FileSystem vs ObjectStore, I'm only familiar with implementations of the first in the context of query engines (such as [Arrow C++'s FileSystem](https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/filesystem.h) or Python's [fsspec](https://filesystem-spec.readthedocs.io/en/latest/)). Are there examples of ObjectStore implementations?
   
   My preference is for a "FileSystem" approach since that's more familiar, but open the ObjectStore approach as long as that can be used to read and write in a way compatible with other systems that may use a FileSystem approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120202658

   Apologies, for the going back and forth, next time i'll take the iterate over my out-loud-thinking and only post a coherent answer...
   
   Last realisation: By having an ObjectStore that only can filter/scan on prefix, we take away the possibility for objectstores to optimise eventual suffix filters (predicate pushdown for file searching as you will).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119781793

   > Are there examples of ObjectStore implementations?
   
   The canonical example of `ObjectStore` is AWS's S3: https://aws.amazon.com/s3/ and then there are many distributed storage systems that present a similar interface, as @tustvold  describes in https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119739041
   
   The idea of the "ObjectStore" interface in DataFusion was to provide API access to the lowest common denominator feature set across several storage implementations. For example, here are three implementations for S3, HDFS, and Azure specifically: 
   * https://github.com/datafusion-contrib/datafusion-objectstore-s3
   * https://github.com/datafusion-contrib/datafusion-objectstore-hdfs
   * https://github.com/datafusion-contrib/datafusion-objectstore-azure
   
   In terms of "glob"ing, that is typically not a feature provided by object stores (e.g. there is no such thing in S3, which instead offers a much more restricted notion of `prefix`es). Thus, it seems to me if we want to support globbing for DataFusion when running on local files, it will have to be a special case somehow.
   
   You can see another example of a Rust API to object storage in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120176920

   Currently the globbing implementation in datafusion is somewhat blurry, because it tries to workaround a limitation of the localfilesystem objectstore implementation.. 
   
   As we all seem to agree, that proper solution would be to fix the LocalFileSystem implementation such that it does not err on a prefix which does not represent a file/directory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119987284

   Another thing to consider is that we probably need the capability to ignore some keys
   
   (eg: some apps such as spark, when badly configured, generate files such as /xxx/_temp/_SUCCESS).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119946378

   ListObjects - list the keys with a given prefix
   => Other key value store supporting glob (and way more complex filters)
   https://redis.io/commands/keys/
   https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.KeyConditionExpressions
   ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1126884341

   > @tustvold thank you very much for driving these efforts. I apologize I have not been able to contribute much to the conversation or code on these. Based on my current capacity I will likely be limited in what I can contribute on most of these in the foreseeable future - the one exception being #2206 which would actually be very helpful on my side. Perhaps I could work with @timvw to get a first cut of this created in `datafusion-contrib` / published to crates.
   
   Getting there (still want to test some things and change some signatures (Eg: return Vec<Result> instead of Result when adding multiple tables at once).. -> [https://github.com/timvw/datafusion-catalogprovider-glue](https://github.com/timvw/datafusion-catalogprovider-glue)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119940868

   In case ETL is less important, we should update the project description:
   
   DataFusion is used to create modern, fast and efficient data pipelines, **ETL processes**, and database systems, which need the performance of Rust and Apache Arrow and want to provide their users the convenience of an SQL interface or a DataFrame API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119784986

   Also, @carols10cents  spent considerable time sorting out consistent directory semantics for object stores and local files in https://github.com/influxdata/influxdb_iox/blob/main/object_store -- maybe we can just use those semantics (or maybe even the code?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119309247

   Also consider (Issue-2465)[https://github.com/apache/arrow-datafusion/issues/2465].
   
   The objectstore is requested to list files that match prefix "/Users/blah//".  
   LocalFileSystem returns items such as "/Users/blah/test.txt" .
   
   One could claim that this path does not match the prefix. 
   One could also claim that an object can have multiple keys and that this file has an alternative key /Users/blah//test.txt" which does match the prefix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] matthewmturner commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

matthewmturner commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120633174

   @tustvold thank you very much for driving these efforts. I apologize I have not been able to contribute much to the conversation or code on these.  Based on my current capacity I will likely be limited in what I can contribute on most of these in the foreseeable future - the one exception being #2206 which would actually be very helpful on my side.   Perhaps I could work with @timvw to get a first cut of this created in `datafusion-contrib` / published to crates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Cheappie commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

Cheappie commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120399875

In my case existing design of ObjectStore interface forced me to re-engineer ListingTable in order to provide yet another way of listing data source.

From my perspective It might be beneficial to push information about data source from TableProvider to ObjectStore. Then ObjectStore for a local file system, would combine data(table) location and strategy for listing that kind of storage. As a result listing methods present in ObjectStore could drop the concept of path as a way to access data.

Then ObjectStore could offer more generic interface with two methods:
* list(filters)
* query filters should be available in ObjectStore list method, to let anyone provide their own predicate pushdown algorithm
* file_reader(sized_file)

Such interface should allow us to provide any kind of listing approach(dir, glob, etc), what do you think ?

It's not a necessity but last component bound to a path is SizedFile, where actually outside of ObjectStore It could be treated as abstract blob with characteristics e.g. `size` because only ObjectStore should know how to access It via `file_reader`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120413298

   > From my perspective It might be beneficial to push information about data source from TableProvider to ObjectStore. Then ObjectStore for a local file system, would combine data(table) location and strategy for listing that kind of storage. As a result listing methods present in ObjectStore could drop the concept of path as a way to access data.
   
   I really like the idea of providing an extensible storage interface that allows APIs such as suggested by @Cheappie  and @timvw. 
   
   Given these APIs seem to be adding semantics to the list of files on ObjectStorage, perhaps we could an extra layer specifically in the APIs rather than trying to extend `ObjectStore` or adding more logic to `ListingTable`. Perhaps something like the `StorageFormat` in:
   
   
   ```text
   ┌───────────────────────────────────┐
   │                                   │
   │           ListingTable            │
   │                                   │
   └───────────────────────────────────┘
   ┌───────────────────────────────────┐
   │          StorageCatalog           │
   │  (e.g figure out which files on   │
   │     object store to process)      │
   └───────────────────────────────────┘
   ┌────────────────┐ ┌────────────────┐
   │  ObjectStore   │ │  File Format   │
   │(e.g. S3, HDFS) │ │ (e.g. parquet) │
   │                │ │                │
   └────────────────┘ └────────────────┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] wjones127 commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119813260

   > If we like the IOx object store interface and want to reuse the implementation, I can probably see about getting it published to crates.io, just let me know.
   
   I would be supportive of that, but we probably would need to discuss what that means for 
   https://github.com/datafusion-contrib/datafusion-objectstore-s3
   https://github.com/datafusion-contrib/datafusion-objectstore-hdfs
   https://github.com/datafusion-contrib/datafusion-objectstore-azure
   
   Do we want to create a new issue to discuss that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1118399492

   Regardless the chosen approach (ObjectStore vs FileSystem) I would consider to make the trait (and it's methods) consistent:
   
   Currently the trait is named ObjectStore but it only has methods related to Files. Either update/rename the methods (and datatypes) such as fn list_object(s) -> ObjectMetadata .. Or rename the trait to FileSystem...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120419414

   I think it is important to keep a separation between:
   
   * Catalog: what data files are where, what schema they have, what encoding they are, etc...
   * Data Access: how to get the data of a specific file
   
   In particular, there is a very common use case where an additional catalog is used to provide query performance, and by keeping the concerns separate we can ensure this is well supported.
   
   Currently I would view the catalog abstraction as `SchemaProvider`/`TableProvider`, and the data access as `ObjectStore`, but there is definitely potential to extract common catalog logic as suggested by @alamb is a good idea :+1:
   
   FWIW I created some tickets a while back on supporting external catalogs (e.g. https://github.com/apache/arrow-datafusion/issues/2206, https://github.com/apache/arrow-datafusion/issues/2208 and https://github.com/apache/arrow-datafusion/issues/2209) which may be relevant here. I also created tickets to make the file operators themselves less coupled with the catalog - https://github.com/apache/arrow-datafusion/issues/2291 and https://github.com/apache/arrow-datafusion/issues/2293.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119783979

   It would help me significantly, to understand the globbing usecase more -- like when exactly are you selecting a subset of files in a directory via a glob? Most analytic systems I have seen tend to assume data has been pre-grouped into directories (or equivalent) 
   
   AWS redshift does offer the ability to specify a subset of files that are not all in the same directory, but it does so by taking a manifest file:  https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120114181

   In summary, I agree with the ObjectStore semantics being sufficient.
   
   I also do want to point out that globbing is nothing more than making the suffix filter more powerful (instead of matching against a static suffix (eg: .parquet) it allows matching against a pattern).
   
       /// Calls `list_file` with a suffix filter
       async fn list_file_with_suffix(
           &self,
           prefix: &str,
           suffix: &str,
       ) -> Result<FileMetaStream>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

timvw commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1120114668

   The most flexible signature of that method would be:
   
   /// Calls `list_file` with a suffix filter
   async fn list_file_with_suffix(
       &self,
       prefix: &str,
       suffix_filter: fn(&str) -> bool,
   ) -> Result<FileMetaStream>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] wjones127 commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

wjones127 commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1119796664

   > I'm not sure what you mean by this
   
   Sorry that wasn't clear. I pointed out two implementations of an abstraction over object stores (S3, GCS, etc.) that are like filesystems (in that they have a notion of directories, not that they make any guarantees about atomicity). These are used by analytics systems like Dask and PyArrow, so there's some evidence we can build useful query engines on top of such an abstraction.
   
   Thanks @alamb for the IOx example.
   
   > Trying to make object storage behave exactly like a filesystem is impossible (e.g. S3 doesn't support CreateIfNotExists), however, my thesis is that no query engine actually wants filesystem semantics,
   
   I largely agree. I think the main thing these "FileSystem" abstractions provide is a notion of "directory", which is important in directory-partitioned datasets. The existing API can handle that fine with delimiter, but it does seem a little funny you can provide whatever delimiter you want.
   
   > Could you expand on what you mean by this, do you mean being able to read data written by another system which should be trivial, or are you talking about some sort of API-level integration like FFI?
   
   Yeah I think as long as you *could* do the expected filesystem operations on top of the API, then that seems fine. For context, I plan to wrap the `ObjectStore` API in a PyArrow-compatible filesystem for use in delta-rs. Hence #2246.
   
   But I think I'll scale back my changes in #2246 and remove the `create_dir()`, `remove_dir()` methods if we want to just think of this as an object store abstraction with no awareness of directories.
   
   > Also, @carols10cents spent considerable time sorting out consistent directory semantics for object stores and local files in https://github.com/influxdata/influxdb_iox/blob/main/object_store -- maybe we can just use those semantics (or maybe even the code?)
   
   That sounds very promising @lamb. Thanks for pointing out!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #2445: ObjectStore Directory Semantics

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #2445:
URL: https://github.com/apache/arrow-datafusion/issues/2445#issuecomment-1130195385

   I'm going to close this as I don't think it is superceded by #2504 
   
   Thank you all for helping move this forward :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org