You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "jpedrick (via GitHub)" <gi...@apache.org> on 2023/05/12 20:05:14 UTC

[GitHub] [arrow] jpedrick opened a new issue, #35575: Cannot call pyarrow.fs.initialize_s3

jpedrick opened a new issue, #35575:
URL: https://github.com/apache/arrow/issues/35575

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Steps to reproduce:
   ```
   >>> import pyarrow.fs
   >>> pyarrow.fs.initialize_s3(pyarrow.fs.S3LogLevel.Debug)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/_s3fs.pyx", line 57, in pyarrow._s3fs.initialize_s3
     File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: S3 was already initialized.  It is safe to use but the options passed in this call have been ignored.
   ```
   
   Version:
   ```
   pip list | grep pyarrow
   pyarrow                  12.0.0
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amoeba commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "amoeba (via GitHub)" <gi...@apache.org>.

amoeba commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1591994110

   Thanks @pitrou, I'll make an issue for that task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1590641771

   Yes, I think an environment variable is a good idea for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1589687295

   @amoeba it's a reasonable idea. At the moment there are only 2 global options so this seems very doable.  I suspect you already know this but there is some precedent with the memory pool configuration here:
   
   https://github.com/apache/arrow/blob/9736dde84bb2e6996d1d12f6a044c33398e3c3a3/cpp/src/arrow/memory_pool.cc#L70
   
   Also, we'd need to update https://arrow.apache.org/docs/cpp/env_vars.html
   
   However, I don't think this solves @jpedrick 's original ask:
   
   > This call itself is somewhat slow, so users may want to defer calling it until they actually initialize an S3FileSystem class


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1548419134

   Yes, I noticed this myself when testing some changes.  Unfortunately, if we just remove the call from the `pyarrow.fs` import then I think it will break some people.  Perhaps we should just do better documentation around the `pyarrow._s3fs` workaround?
   
   However, if the goal is to actually defer S3 initialization until S3 is used then I don't think this helps.
   
   Ideally, we could move the ensure initialized function inside of the `S3FileSystem` constructor but then this wouldn't catch cases where we use S3 indirectly (e.g. from an S3 URI).
   
   We could push the auto-initialization into C++ I suppose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1561532955

   @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amoeba commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "amoeba (via GitHub)" <gi...@apache.org>.

amoeba commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1586274440

   Tangential to improving the docs, we're having a discussion over in #35398 about how customizing S3 initialization should work in R. I initially tried to make R's similar to Python but one idea I had was to just add another `ARROW_` env var, like `ARROW_S3_LOG_LEVEL` and have the C++, Python, and R implementations respect it when the initialize S3. With a helper like `initialize_s3`/`s3_init`, you have to restart the process to affect a change so making it an environment variable may be less confusing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] dalbani commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "dalbani (via GitHub)" <gi...@apache.org>.

dalbani commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1575216192

   > Perhaps we should just do better documentation around the pyarrow._s3fs workaround?
   
   Documenting it would already be very useful.
   Unless I had found it here, I don't think I would have thought of the "underscore trick".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35575: [Python] Cannot call pyarrow.fs.initialize_s3

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35575:
URL: https://github.com/apache/arrow/issues/35575#issuecomment-1585146486

   Would you like to propose a PR with some suggested wording?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org