You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/10/08 19:05:00 UTC

[jira] [Created] (ARROW-1658) [Python] Out of bounds dictionary indices causes segfault after converting to pandas

Wes McKinney created ARROW-1658:
-----------------------------------

             Summary: [Python] Out of bounds dictionary indices causes segfault after converting to pandas
                 Key: ARROW-1658
                 URL: https://issues.apache.org/jira/browse/ARROW-1658
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.7.1
            Reporter: Wes McKinney
             Fix For: 0.8.0


Minimal reproduction:

{code}
import numpy as np
import pandas as pd
import pyarrow as pa
 
num = 100
arr = pa.DictionaryArray.from_arrays(
    np.arange(0, num),
    np.array(['a'], np.object),
    np.zeros(num, np.bool),
    True)

print(arr.to_pandas())
{code}

At no time in the Arrow codebase do we validate that the dictionary indices are in bounds. It seems that pandas is overly trusting of the validity of the indices. So we should add a method someplace to validate that the dictionary non-null indices are not out of bounds (perhaps in {{CategoricalBlock::WriteIndices}}).

As an aside: there may be other times when doing analytics on categorical data that external data will have out of bounds index values. We should plan for these and decide whether to raise an exception or treat them as null



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)