You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/17 17:26:54 UTC

[GitHub] [arrow] crusaderky opened a new issue, #33727: pandas string[pyarrow] -> category -> to_parquet fails

crusaderky opened a new issue, #33727:
URL: https://github.com/apache/arrow/issues/33727

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   pandas 1.5.2
   pyarrow 10.0.1
   
   If you convert a pandas Series with dtype `string[pyarrow]` to `category`, the categories will be `string[pyarrow]`. So far, so good.
   However, when you try writing the resulting object to parquet, PyArrow fails as it does not recognize its own datatype.
   
   ## Reproducer
   ```python
   >>> import pandas as pd
   >>> df = pd.DataFrame({"x": ["foo", "bar", "foo"], dtype="string[pyarrow]")
   >>> df.dtypes.x
   string[pyarrow]
   >>> df = df.astype("category")
   >>> df.dtypes.x
   CategoricalDtype(categories=['bar', 'foo'], ordered=False)
   >>> df.dtypes.x.categories.dtype
   string[pyarrow]
   >>> df.to_parquet("foo.parquet")
   pyarrow.lib.ArrowInvalid: ("Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column x with type category')
   ```
   ## Workaround
   ```python
   df = df.astype(
       {
           k: pd.CategoricalDtype(v.categories.astype(object))
           for k, v in df.dtypes.items()
           if isinstance(v, pd.CategoricalDtype)
           and v.categories.dtype == "string[pyarrow]"
       }
   )
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche closed issue #33727: [Python] array() errors if pandas categorical column has dictionary as string not object

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #33727: [Python] array() errors if pandas categorical column has dictionary as string not object
URL: https://github.com/apache/arrow/issues/33727


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on issue #33727: pandas string[pyarrow] -> category -> to_parquet fails

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1386699394

   Thank you for reporting @crusaderky!
   It seems `array()` method can't handle categorical pandas columns if the dictionary is `string` type.
   
   The error is triggered in `pandas_compat.py`
   https://github.com/apache/arrow/blob/f769f6b32373fcf5fc2a7a51152b375127ca4af7/python/pyarrow/pandas_compat.py#L591-L598
   
   due to `array()` method erroring with `ArrowInvalid`:
   
   ```python
   # Works with string series/column
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   pa.array(df["x"])
   # <pyarrow.lib.ChunkedArray object at 0x12dcbbef0>
   # [
   #   [
   #     "foo",
   #     "bar",
   #     "foo"
   #   ]
   # ]
   
   # Works with categorical with dictionary as object type
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
   df = df.astype("category")
   pa.array(df["x"])
   # <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0>
   
   # -- dictionary:
   #   [
   #     "bar",
   #     "foo"
   #   ]
   # -- indices:
   #   [
   #     1,
   #     0,
   #     1
   #   ]
   
   # Errors if dictionary in categorical column is string
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   df = df.astype("category")
   pa.array(df["x"])
   # Traceback (most recent call last):
   #   File "<stdin>", line 1, in <module>
   #   File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
   #     return DictionaryArray.from_arrays(
   #   File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
   #     _dictionary = array(dictionary, memory_pool=memory_pool)
   #   File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
   #     result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
   #   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
   #     chunked = GetResultValue(
   #   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
   #     return check_status(status)
   #   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   #     raise ArrowInvalid(message)
   # pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
   ```
   
   Debugging from `convert_column` in `dataframe_to_arrays` (_pandas_compat.py_)
   ```python
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
   df = df.astype("category")
   dataframe_to_arrays(df, schema=None, preserve_index=None)
   # > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
   # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   (Pdb) col
   # 0    foo
   # 1    bar
   # 2    foo
   # Name: x, dtype: category
   # Categories (2, object): ['bar', 'foo']
   
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   df = df.astype("category")
   dataframe_to_arrays(df, schema=None, preserve_index=None)
   # > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
   # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   (Pdb) col
   # 0    foo
   # 1    bar
   # 2    foo
   # Name: x, dtype: category
   # Categories (2, string): [bar, foo]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] crusaderky commented on issue #33727: pandas string[pyarrow] -> category -> to_parquet fails

Posted by GitBox <gi...@apache.org>.
crusaderky commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1385776382

   FYI @jrbourbeau @ncclementi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #33727: [Python] array() errors if pandas categorical column has dictionary as string not object

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1387323624

   Inside `pa.array(..)`, we convert a pandas.Categorical by converting its indices and categories to an Array, and then call `pa.DictionaryArray.from_array`. The first step works:
   
   ```
   In [36]: indices = pa.array(df['x'].cat.codes)
   
   In [37]: df["x"].cat.categories.values
   Out[37]: 
   <ArrowStringArray>
   ['bar', 'foo']
   Length: 2, dtype: string
   
   In [39]: dictionary = pa.array(df["x"].cat.categories.values)
   
   In [40]: dictionary
   Out[40]: 
   <pyarrow.lib.ChunkedArray object at 0x7f9e87f0e7a0>
   [
     [
       "bar",
       "foo"
     ]
   ]
   ```
   
   But the converted categories result in a ChunkedArray, not a plain Array, and then it is `DictionaryArray.from_arrays` that fails: it expects an Array, and if the passed dictionary is not already an Array, try to convert it to one:
   
   ```
   In [43]: pa.DictionaryArray.from_arrays(indices, dictionary)
   ...
   ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
   
   In [44]: pa.array(dictionary)
   ...
   ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
   ```
   
   We should probably ensure that `pa.array(..)` returns an Array instead of ChunkedArray if there is only one chunk (that is also logic that could live inside pandas' `StringArray.__arrow_array__`). 
   I am not sure if our APIs should accept a ChunkedArray (and automatically concatenate the chunks?) in `DictionaryArray.from_arrays`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org