You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jpedrick (via GitHub)" <gi...@apache.org> on 2023/05/12 20:05:14 UTC
[GitHub] [arrow] jpedrick opened a new issue, #35575: Cannot call pyarrow.fs.initialize_s3
jpedrick opened a new issue, #35575:
URL: https://github.com/apache/arrow/issues/35575
### Describe the bug, including details regarding any error messages, version, and platform.
Steps to reproduce:
```
>>> import pyarrow.fs
>>> pyarrow.fs.initialize_s3(pyarrow.fs.S3LogLevel.Debug)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_s3fs.pyx", line 57, in pyarrow._s3fs.initialize_s3
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: S3 was already initialized. It is safe to use but the options passed in this call have been ignored.
```
Version:
```
pip list | grep pyarrow
pyarrow 12.0.0
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] amoeba commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1591994110
Thanks @pitrou, I'll make an issue for that task.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1590641771
Yes, I think an environment variable is a good idea for this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1589687295
@amoeba it's a reasonable idea. At the moment there are only 2 global options so this seems very doable. I suspect you already know this but there is some precedent with the memory pool configuration here:
https://github.com/apache/arrow/blob/9736dde84bb2e6996d1d12f6a044c33398e3c3a3/cpp/src/arrow/memory_pool.cc#L70
Also, we'd need to update https://arrow.apache.org/docs/cpp/env_vars.html
However, I don't think this solves @jpedrick 's original ask:
> This call itself is somewhat slow, so users may want to defer calling it until they actually initialize an S3FileSystem class
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1548419134
Yes, I noticed this myself when testing some changes. Unfortunately, if we just remove the call from the `pyarrow.fs` import then I think it will break some people. Perhaps we should just do better documentation around the `pyarrow._s3fs` workaround?
However, if the goal is to actually defer S3 initialization until S3 is used then I don't think this helps.
Ideally, we could move the ensure initialized function inside of the `S3FileSystem` constructor but then this wouldn't catch cases where we use S3 indirectly (e.g. from an S3 URI).
We could push the auto-initialization into C++ I suppose.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1561532955
@jorisvandenbossche
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] amoeba commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1586274440
Tangential to improving the docs, we're having a discussion over in #35398 about how customizing S3 initialization should work in R. I initially tried to make R's similar to Python but one idea I had was to just add another `ARROW_` env var, like `ARROW_S3_LOG_LEVEL` and have the C++, Python, and R implementations respect it when the initialize S3. With a helper like `initialize_s3`/`s3_init`, you have to restart the process to affect a change so making it an environment variable may be less confusing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] dalbani commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "dalbani (via GitHub)" <gi...@apache.org>.
dalbani commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1575216192
> Perhaps we should just do better documentation around the pyarrow._s3fs workaround?
Documenting it would already be very useful.
Unless I had found it here, I don't think I would have thought of the "underscore trick".
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1585146486
Would you like to propose a PR with some suggested wording?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org