You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2020/09/29 10:57:00 UTC

[jira] [Assigned] (ARROW-6607) [Python] Support for set/list columns when converting from Pandas

     [ https://issues.apache.org/jira/browse/ARROW-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krisztian Szucs reassigned ARROW-6607:
--------------------------------------

    Assignee: Krisztian Szucs

> [Python] Support for set/list columns when converting from Pandas
> -----------------------------------------------------------------
>
>                 Key: ARROW-6607
>                 URL: https://issues.apache.org/jira/browse/ARROW-6607
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>         Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10
>            Reporter: Giora Simchoni
>            Assignee: Krisztian Szucs
>            Priority: Major
>             Fix For: 2.0.0
>
>
> Hi,
> Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), set([3,4,5])]})
> df.to_feather('test.ft')
> ```
> I get:
> ```
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2131, in to_feather
>  to_feather(self, fname)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
>  table = Table.from_pandas(df, preserve_index=False)
>  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 496, in dataframe_to_arrays
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 496, in <listcomp>
>  for c, f in zip(columns_to_convert, convert_fields)]
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column
>  raise e
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column
>  result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
>  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
>  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')
> ```
> And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.
> Questions:
> (1) Is it possible to support these kind of set/list columns?
> (2) Anyone has an idea on how to deal with this? I *cannot* unnest these set/list columns as this would explode the DataFrame. My only other idea is to convert set `\{1,2}` into a string `1,2` and parse it after reading the file. And hoping it won't be slow.
>  
> Update:
> With lists column the error is different:
> ```python
> import pandas as pd
> df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})
> df.to_feather('test.ft')
> ```
> ```
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2131, in to_feather
>  to_feather(self, fname)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
>  feather.write_feather(df, path)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
>  writer.write(df)
>  File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 97, in write
>  self.writer.write_array(name, col.data.chunk(0))
>  File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
>  File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: list<item: int64>
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)