You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/28 16:30:29 UTC

[GitHub] [arrow-rs] wjones127 opened a new issue, #3970: [object_store] Add option to start listing at a particular key

wjones127 opened a new issue, #3970:
URL: https://github.com/apache/arrow-rs/issues/3970

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   In an object store, we might have a bunch of sequential files being written:
   
   ```
   0000001.json
   0000002.json
   ...
   0001000.json
   ```
   
   We'd like to be able to query for all the "new" files starting at a certain point, skipping all the earlier files.
   
   S3 has a `start-after` parameter we can use for this. https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestParameters
   TDB on other systems.
   
   **Describe the solution you'd like**
   
   Not sure the best way to add the parameter. Does it belong in a new method? Should we introduce a more complex "ListCallBuilder" API?
   
   ```rust
   let list_stream = object_store
        .build_list_call(Some(&prefix))
        .with_start_key("0000999.json")
        .await
        .expect("Error listing files");
   ```
   
   **Describe alternatives you've considered**
   
   Not sure if there is an easier way to do it.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3970: [object_store] Add option to start listing at a particular key

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3970:
URL: https://github.com/apache/arrow-rs/issues/3970#issuecomment-1487338525

   Seems like a useful feature, I'll have a think about what an API for this could look like. I wonder if we should just add a `list_opts` method that acts as a superset of `list_with_delimiter` and `list` :thinking: 
   
   What is support for this like in other object stores, I presume they support it, but I've learnt never to assume anything when it comes to object stores :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] rtyler commented on issue #3970: [object_store] Add option to start listing at a particular key

Posted by "rtyler (via GitHub)" <gi...@apache.org>.
rtyler commented on issue #3970:
URL: https://github.com/apache/arrow-rs/issues/3970#issuecomment-1487404789

   Speaking selfishly supporting S3-based optimizations goes a long way given its dominance in the market. Most data workloads I see are on AWS or GCP, so that's great you found a compatible API in GCS @wjones127 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3970: [object_store] Add option to start listing at a particular key

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #3970: [object_store] Add option to start listing at a particular key
URL: https://github.com/apache/arrow-rs/issues/3970


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3970: [object_store] Add option to start listing at a particular key

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3970:
URL: https://github.com/apache/arrow-rs/issues/3970#issuecomment-1487725553

   I've created https://github.com/apache/arrow-rs/pull/3973 if we like the interface I can flesh it out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] wjones127 commented on issue #3970: [object_store] Add option to start listing at a particular key

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #3970:
URL: https://github.com/apache/arrow-rs/issues/3970#issuecomment-1487380859

   It looks like support isn't *that* wide:
   
   * [S3](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_RequestParameters): has `start-after` (exclusive?)
   * [GCS](https://cloud.google.com/storage/docs/json_api/v1/objects/list): has a `startOffset` (inclusive) 
   * [Azure Blob Store](https://learn.microsoft.com/en-us/rest/api/storageservices/list-blobs?tabs=azure-ad) not supported.
   * I don't see any obvious API for local filesystems.
   
   So this is mostly providing a useful optimization for S3 and GCS. There can be a default implementation that just throws out earlier entries. Also, for consistency between S3 and GCS, we would have to make the lower bound exclusive, since that seems to be the S3 behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] JHibbard commented on issue #3970: [object_store] Add option to start listing at a particular key

Posted by "JHibbard (via GitHub)" <gi...@apache.org>.
JHibbard commented on issue #3970:
URL: https://github.com/apache/arrow-rs/issues/3970#issuecomment-1488010076

   The Azure Data Lake Storage Gen2 REST API has endpoints for [`filesystem list`](https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/filesystem/list) and [`path list`](https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list) that looks interesting, but the documentation is vague. ADLS-G2 is hierarchical in nature... so some offset/skipping API is likely available somewhere.
   
   filesystem list:
   - prefix: Filters results to filesystems within the specified prefix.
   
   path list:
   - directory: Filters results to paths within the specified directory. An error occurs if the directory does not exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org