You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "thomasjpfan (via GitHub)" <gi...@apache.org> on 2023/04/04 15:14:58 UTC

[GitHub] [arrow] thomasjpfan opened a new issue, #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

thomasjpfan opened a new issue, #34886:
URL: https://github.com/apache/arrow/issues/34886

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Running `np.asarary` on a PyArrow Table returns the transpose of the data:
   
   ```python
   import pyarrow as pa
   import pandas as pd
   import numpy as np
   
   df = pd.DataFrame({'year': [2020, 2022, 2019, 2021],
                      'n_legs': [2, 4, 5, 100]})
   pa_table = pa.Table.from_pandas(df)
   
   # Converting to pandas first gives the expected result:
   print(np.asarray(pa_table.to_pandas()))
   # [[2020    2]
   #  [2022    4]
   #  [2019    5]
   #  [2021  100]]
   
   # Calling `np.asarray` directly gives the transpose:
   print(np.asarray(pa_table))
   # [[2020 2022 2019 2021]
   #  [   2    4    5  100]]
   ```
   
   I expect that `np.asarray` gives the same result as `np.asarray(pa_table.to_pandas())`.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #34886:
URL: https://github.com/apache/arrow/issues/34886#issuecomment-1500588323

   I would still call it a bug (if it works, i.e. it returns something, it shouldn't transpose the data), but I think it is indeed caused because we only implemented numpy compatibility on the array level, as Dane mentioned. 
   
   When doing `np.asarray(..)` on a pyarrow Table, numpy sees an object that hasn't any of the protocol methods like `__array__`, but it does see an iterable object with getitem, and so will try to convert it to an array like any list like. Illustrating this with converting to a list:
   
   ```
   In [2]: table = pa.table({'a': [1, 2, 3], 'b': [4, 5, 6]})
   
   In [3]: list(table)
   Out[3]: 
   [<pyarrow.lib.ChunkedArray object at 0x7fb21b832e30>
    [
      [
        1,
        2,
        3
      ]
    ],
    <pyarrow.lib.ChunkedArray object at 0x7fb21b8328e0>
    [
      [
        4,
        5,
        6
      ]
    ]]
   ```
   
   So we get here a list of the column values, each being a ChunkedArray. But because those arrays now actually do have numpy compatibility with `__array__`, numpy will actually further unpack those and instead of creating a 1D array of the column objects, it creates a 2D array. But with the number of columns (how it got unpacked initially) as the first dimension. And this then results in this "transposed" result compared to how you would expect it.
   
   Leaving this as is doesn't sound as a good idea, given the unexpected shape. Two options I would think of:
   
   * Explicitly disallow conversion to numpy (I suppose we could raise an error in `__array__`, although would have to check if numpy doesn't still fallback to the current method then). And leave this to the user to do themselves (or go through another library that does this)
   * Actually implement `Table.__array__`. 
   
   A simple implementation (for us or for external users) could be `np.stack([np.asarray(col) for col in table], axis=1)`:
   
   ```
   In [14]: np.stack([np.asarray(col) for col in table], axis=1)
   Out[14]: 
   array([[1, 4],
          [2, 5],
          [3, 6]])
   ```
   
   I don't know if that will start to fail with more complex cases, though. Although it seems if the dtypes are not compatible, `np.stack` gives you object dtype instead of erroring.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34886:
URL: https://github.com/apache/arrow/issues/34886#issuecomment-1500604313

   +1 Thanks for the correction and the detailed examples @jorisvandenbossche! I agree we can call this a bug. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on issue #34886: `np.asarray(parrow_table)` returns a transposed representation of the data

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on issue #34886:
URL: https://github.com/apache/arrow/issues/34886#issuecomment-1500494106

   Hi @thomasjpfan,
   
   This isn't a bug, but a difference in the underlying storage layout of the objects (and the limitations of that). 
   
   Arrow supports interoperability with numpy at the array level (https://arrow.apache.org/docs/python/numpy.html). What you are seeing is the zero-copy conversion of the arrow columnar storage format into numpy arrays for each column (https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions). If you don't want to view the data in this format, a copy of the data needs to be made. This is inefficient and usually not the desired behavior. You'll need to implement the copying outside of pyarrow if you want this without using pandas. 
   
   For more complex datatypes (e.g. dataframes), you'll need to use pyarrow's pandas interoperability like in your example (https://arrow.apache.org/docs/python/pandas.html#pandas-integration).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed issue #34886: [Python] `np.asarray(parrow_table)` returns a transposed representation of the data

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche closed issue #34886: [Python] `np.asarray(parrow_table)` returns a transposed representation of the data
URL: https://github.com/apache/arrow/issues/34886


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org