You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/02/08 03:08:00 UTC

[jira] [Updated] (ARROW-3806) [Python] When converting nested types to pandas, use tuples

     [ https://issues.apache.org/jira/browse/ARROW-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-3806:
--------------------------------
    Fix Version/s: 0.14.0

> [Python] When converting nested types to pandas, use tuples
> -----------------------------------------------------------
>
>                 Key: ARROW-3806
>                 URL: https://issues.apache.org/jira/browse/ARROW-3806
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.11.1
>         Environment: Fedora 29, pyarrow installed with conda
>            Reporter: Suvayu Ali
>            Priority: Minor
>              Labels: pandas
>             Fix For: 0.14.0
>
>
> When converting to pandas, convert nested types (e.g. list) to tuples.  Columns with lists are difficult to query.  Here are a few unsuccessful attempts:
> {code}
> >>> mini
>     CHROM    POS           ID            REF    ALTS  QUAL
> 80     20  63521  rs191905748              G     [A]   100
> 81     20  63541  rs117322527              C     [A]   100
> 82     20  63548  rs541129280              G    [GT]   100
> 83     20  63553  rs536661806              T     [C]   100
> 84     20  63555  rs553463231              T     [C]   100
> 85     20  63559  rs138359120              C     [A]   100
> 86     20  63586  rs545178789              T     [G]   100
> 87     20  63636  rs374311122              G     [A]   100
> 88     20  63696  rs149160003              A     [G]   100
> 89     20  63698  rs544072005              A     [C]   100
> 90     20  63729  rs181483669              G     [A]   100
> 91     20  63733   rs75670495              C     [T]   100
> 92     20  63799    rs1418258              C     [T]   100
> 93     20  63808   rs76004960              G     [C]   100
> 94     20  63813  rs532151719              G     [A]   100
> 95     20  63857  rs543686274  CCTGGAAAGGATT     [C]   100
> 96     20  63865  rs551938596              G     [A]   100
> 97     20  63902  rs571779099              A     [T]   100
> 98     20  63963  rs531152674              G     [A]   100
> 99     20  63967  rs116770801              A     [G]   100
> 100    20  63977  rs199703510              C     [G]   100
> 101    20  64016  rs143263863              G     [A]   100
> 102    20  64062  rs148297240              G     [A]   100
> 103    20  64139  rs186497980              G  [A, T]   100
> 104    20  64150    rs7274499              C     [A]   100
> 105    20  64151  rs190945171              C     [T]   100
> 106    20  64154  rs537656456              T     [G]   100
> 107    20  64175  rs116531220              A     [G]   100
> 108    20  64186  rs141793347              C     [G]   100
> 109    20  64210  rs182418654              G     [C]   100
> 110    20  64303  rs559929739              C     [A]   100
> {code}
> # I think this one fails because it tries to broadcast the comparison.
> {code}
> >>> mini[mini.ALTS == ["A", "T"]]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1283, in wrapper
>     res = na_op(values, other)
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1143, in na_op
>     result = _comp_method_OBJECT_ARRAY(op, x, y)
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1120, in _comp_method_OBJECT_ARRAY
>     result = libops.vec_compare(x, y, op)
>   File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
> ValueError: Arrays were different lengths: 31 vs 2
> {code}
> # I think this fails due to a similar reason, but the broadcasting is happening at a different place.
> {code}
> >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
>     return self._getitem_array(key)
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
>     indexer = self.loc._convert_to_indexer(key, axis=1)
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
>     indexer = check = labels.get_indexer(objarr)
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
>     indexer = self._engine.get_indexer(target._ndarray_values)
>   File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
> TypeError: unhashable type: 'numpy.ndarray'
> >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
> 80     [True, False]
> 81     [True, False]
> 82    [False, False]
> 83    [False, False]
> 84    [False, False]
> {code}
> # Unfortunately this clever hack fails as well!
> {code}
> >>> c = np.empty(1, object)
> >>> c[0] = ["A", "T"]
> >>> mini[mini.ALTS.values == c]
> Traceback (most recent call last):
>   File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
>     return self._engine.get_loc(key)
>   File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
>   File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: False
> >>> mini.ALTS.values == c
> False
> {code}
> Finally, what succeeds is the following (probably because of the immutability of tuple):
> {code}
> >>> mini["ALTS2"] = mini.ALTS.apply(tuple)
> >>> mini.head()
>    CHROM    POS           ID REF  ALTS  QUAL  ALTS2
> 80    20  63521  rs191905748   G   [A]   100   (A,)
> 81    20  63541  rs117322527   C   [A]   100   (A,)
> 82    20  63548  rs541129280   G  [GT]   100  (GT,)
> 83    20  63553  rs536661806   T   [C]   100   (C,)
> 84    20  63555  rs553463231   T   [C]   100   (C,)
> >>> mini[mini["ALTS2"] == ("A", "T")]
>     CHROM    POS           ID REF    ALTS  QUAL   ALTS2
> 103    20  64139  rs186497980   G  [A, T]   100  (A, T)
> >>> mini[mini["ALTS2"] == ("GT",)]
>    CHROM    POS           ID REF  ALTS  QUAL  ALTS2
> 82    20  63548  rs541129280   G  [GT]   100  (GT,)
> >>> mini[mini["ALTS2"] == tuple("C")]
>     CHROM    POS           ID            REF ALTS  QUAL ALTS2
> 83     20  63553  rs536661806              T  [C]   100  (C,)
> 84     20  63555  rs553463231              T  [C]   100  (C,)
> 89     20  63698  rs544072005              A  [C]   100  (C,)
> 93     20  63808   rs76004960              G  [C]   100  (C,)
> 95     20  63857  rs543686274  CCTGGAAAGGATT  [C]   100  (C,)
> 109    20  64210  rs182418654              G  [C]   100  (C,)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)