You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/11/14 18:23:00 UTC
[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect
provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974511#comment-16974511 ]
Joris Van den Bossche commented on ARROW-7168:
----------------------------------------------
[~buhrmann] thanks for the report. When passing a type like that, I agree it should be honoured.
Some other observations:
Also when it's not all-NaN, the specified type gets ignored:
{code}
In [19]: cat = pd.Categorical(['a', 'b'])
In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False)
In [21]: pa.array(cat, type=typ)
Out[21]:
<pyarrow.lib.DictionaryArray object at 0x7ff87b6a50b8>
-- dictionary:
[
"a",
"b"
]
-- indices:
[
0,
1
]
In [22]: pa.array(cat, type=typ).type
Out[22]: DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
{code}
So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match).
In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories).
> [Python] pa.array() doesn't respect provided dictionary type with all NaNs
> --------------------------------------------------------------------------
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.15.1
> Reporter: Thomas Buhrmann
> Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)