You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jbrockmendel (via GitHub)" <gi...@apache.org> on 2023/04/12 18:53:51 UTC

[GitHub] [arrow] jbrockmendel opened a new issue, #35081: REF: avoid using pandas internals

jbrockmendel opened a new issue, #35081:
URL: https://github.com/apache/arrow/issues/35081

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   ATM pyarrow passes a BlockManager to the pd.DataFrame constructor, but in doing so accesses pandas functions/classes that are not public and that some pandas maintainers (me) would like to ween arrow off of (xref https://github.com/pandas-dev/pandas/pull/52419).
   
   @jorisvandenbossche tells me the current usage is performance motivated, particularly (but not exclusively?) the performance hit associated with pandas silently consolidating, which it no longer does in 2.0.
   
   Let's see if we can find an alternative using pandas' public API.  Starting bid: is pd.DataFrame.from_arrays a viable alternative?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] avoid using pandas internals [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35081:
URL: https://github.com/apache/arrow/issues/35081#issuecomment-1770237575

   @graingert yes, although you shouldn't see that as an end user, fixing that on the pandas side (-> https://github.com/pandas-dev/pandas/pull/52419#issuecomment-1770215326)
   
   You will see the warning when enabling all warnings to show or error (like when running tests), so we have to switch to a different API in pyarrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] avoid using pandas internals [arrow]

Posted by "graingert (via GitHub)" <gi...@apache.org>.
graingert commented on issue #35081:
URL: https://github.com/apache/arrow/issues/35081#issuecomment-1768445485

   pandas raises a DeprecationWarning: Passing a BlockManager to DataFrame is deprecated and will raise in a future version. Use public APIs instead for this use now:
   
   ```python
   import pyarrow.dataset
   import fsspec
   
   paths = [
       "https://github.com/Parquet/parquet-compatibility/raw/master/parquet-testdata/impala/1.1.1-NONE/nation.impala.parquet"
   ]
   (
       pyarrow.dataset.dataset(paths, filesystem=fsspec.filesystem("http"))
       .schema.empty_table()
       .to_pandas()
   )
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] avoid using pandas internals [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35081:
URL: https://github.com/apache/arrow/issues/35081#issuecomment-2058474927

   Issue resolved by pull request 40897
   https://github.com/apache/arrow/pull/40897


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] avoid using pandas internals [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #35081: [Python] avoid using pandas internals
URL: https://github.com/apache/arrow/issues/35081


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org