You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/05/11 10:53:22 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #4199: ObjectStore with_url Should Handle Path

tustvold opened a new issue, #4199:
URL: https://github.com/apache/arrow-rs/issues/4199

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   -->
   
   Various builders such as `AmazonS3Builder`, `MicrosoftAzureBuilder`, etc.. provide a `with_url` method.
   
   However, with exception to URL patterns such as `https://s3.region.amazonaws.com/bucket` which encode the bucket name in the URL, they ignore the path. This is surprising and inconsistent with stores such as `HttpStore` and `LocalFileSystem` which have a built-in notion of a prefix.
   
   This can to a certain extent be worked around with `PrefixStore`, but implementing this logic correctly requires duplicating the logic to understand what parts of a given URL are the prefix
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   I would like the cloud stores to have a `with_prefix` option, and to populate this within `with_url`.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   
   Relates to #4047
   
   delta-rs has some logic [here](https://github.com/delta-io/delta-rs/blob/c8371b38fdf22802f0f91b4ddc2a47da6be97c68/rust/src/storage/config.rs#LL198C5-L198C5) to handle this, although this will misbehave for URLs where the bucket name is encoded in the path.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "flokli (via GitHub)" <gi...@apache.org>.
flokli commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1986874316

   I stumbled over this today. While `parse_url` returns the rest of the URL, I need to use the `*Builder` structs to be able to also allow configurability using environment variables etc (`from_env`).
   
   Being able to get the path out of an URL is quite messy, I'd rather not reimplement this on my own. Of course I could use `parse_url` as a "path extractor" and just throw away the also returned `Box<dyn ObjectStore>` immediately, but it's a bit ugly.
   
   I'd much prefer `with_url` in all `*Builder` structs to put the Path it found into a field, and providing a function to return it from there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "flokli (via GitHub)" <gi...@apache.org>.
flokli commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1991241854

   > Perhaps we could make https://github.com/apache/arrow-rs/blob/master/object_store%2Fsrc%2Fparse.rs#L71 public?
   
   That'd help. We'd might still parse the URL twice, but that's less of a deal than constructing and throwing away the entire `dyn ObjectStore`.
   > 
   > > But then what's the point of having the more composable cloud-specific builders in first place?
   > 
   > For the use-cases where stores aren't configured by a URL??
   
   Hmmh, I think in most applications you almost definitely want to have a combination of both. `s3://my-bucket/some-subpath` works as a pretty good identifier to specify the protocol, bucket name and subpath where something is located, so that's something people exchange in a configuration file.
   
   However, "how to get there" usually is very specific to the environment - a local developer on their local machine might have a static access key pair, or some AWS SSO config, while production workloads might use k8s and IAM roles for service accounts (env vars), or IAM roles for EC2 provided by the instance metadata server (implicit).
   
   IMHO, there's a lot of value in `object_store` having the same semantics as the "official" cloud-provider SDKs, even if the users of `object_store` use `parse_url`. Which would mean, respecting cloud-provider-specific means of configuration (env vars, ambient metadata servers) by default, and at least having the real "path" part of a URL prominently accessible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "flokli (via GitHub)" <gi...@apache.org>.
flokli commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1988817740

   But then what's the point of having the more composable cloud-specific builders in first place?
   
   The contain path parsing logic, being able to extract the leftover path is all that'd be needed to be able to use them for this usecase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4199: ObjectStore with_url Should Handle Path

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1544299173

   I actually decided against this, in favor of returning the remaining path from parse_url in #4200 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #4199: ObjectStore with_url Should Handle Path

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4199: ObjectStore with_url Should Handle Path
URL: https://github.com/apache/arrow-rs/issues/4199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1992525446

   > Hmmh, I think in most applications you almost definitely want to have a combination of both. s3://my-bucket/some-subpath works as a pretty good identifier to specify the protocol, bucket name and subpath where something is located, so that's something people exchange in a configuration file.
   
   Whilst I agree that this is something people exchange, the challenge is when people then start creating object stores per path, instead of per bucket. This has been a frequent source of throughput issues people have run into


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1986942425

   All from_env does is look in the environment for configuration keys, you could do likewise and pass what you found to parse_url


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ObjectStore with_url Should Handle Path [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4199:
URL: https://github.com/apache/arrow-rs/issues/4199#issuecomment-1989168633

   Perhaps we could make https://github.com/apache/arrow-rs/blob/master/object_store%2Fsrc%2Fparse.rs#L71 public?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org