You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/02/08 03:08:00 UTC
[jira] [Updated] (ARROW-3806) [Python] When converting nested types
to pandas, use tuples
[ https://issues.apache.org/jira/browse/ARROW-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-3806:
--------------------------------
Fix Version/s: 0.14.0
> [Python] When converting nested types to pandas, use tuples
> -----------------------------------------------------------
>
> Key: ARROW-3806
> URL: https://issues.apache.org/jira/browse/ARROW-3806
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.11.1
> Environment: Fedora 29, pyarrow installed with conda
> Reporter: Suvayu Ali
> Priority: Minor
> Labels: pandas
> Fix For: 0.14.0
>
>
> When converting to pandas, convert nested types (e.g. list) to tuples. Columns with lists are difficult to query. Here are a few unsuccessful attempts:
> {code}
> >>> mini
> CHROM POS ID REF ALTS QUAL
> 80 20 63521 rs191905748 G [A] 100
> 81 20 63541 rs117322527 C [A] 100
> 82 20 63548 rs541129280 G [GT] 100
> 83 20 63553 rs536661806 T [C] 100
> 84 20 63555 rs553463231 T [C] 100
> 85 20 63559 rs138359120 C [A] 100
> 86 20 63586 rs545178789 T [G] 100
> 87 20 63636 rs374311122 G [A] 100
> 88 20 63696 rs149160003 A [G] 100
> 89 20 63698 rs544072005 A [C] 100
> 90 20 63729 rs181483669 G [A] 100
> 91 20 63733 rs75670495 C [T] 100
> 92 20 63799 rs1418258 C [T] 100
> 93 20 63808 rs76004960 G [C] 100
> 94 20 63813 rs532151719 G [A] 100
> 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100
> 96 20 63865 rs551938596 G [A] 100
> 97 20 63902 rs571779099 A [T] 100
> 98 20 63963 rs531152674 G [A] 100
> 99 20 63967 rs116770801 A [G] 100
> 100 20 63977 rs199703510 C [G] 100
> 101 20 64016 rs143263863 G [A] 100
> 102 20 64062 rs148297240 G [A] 100
> 103 20 64139 rs186497980 G [A, T] 100
> 104 20 64150 rs7274499 C [A] 100
> 105 20 64151 rs190945171 C [T] 100
> 106 20 64154 rs537656456 T [G] 100
> 107 20 64175 rs116531220 A [G] 100
> 108 20 64186 rs141793347 C [G] 100
> 109 20 64210 rs182418654 G [C] 100
> 110 20 64303 rs559929739 C [A] 100
> {code}
> # I think this one fails because it tries to broadcast the comparison.
> {code}
> >>> mini[mini.ALTS == ["A", "T"]]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1283, in wrapper
> res = na_op(values, other)
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1143, in na_op
> result = _comp_method_OBJECT_ARRAY(op, x, y)
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1120, in _comp_method_OBJECT_ARRAY
> result = libops.vec_compare(x, y, op)
> File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
> ValueError: Arrays were different lengths: 31 vs 2
> {code}
> # I think this fails due to a similar reason, but the broadcasting is happening at a different place.
> {code}
> >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
> return self._getitem_array(key)
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
> indexer = self.loc._convert_to_indexer(key, axis=1)
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
> indexer = check = labels.get_indexer(objarr)
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
> indexer = self._engine.get_indexer(target._ndarray_values)
> File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
> File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
> TypeError: unhashable type: 'numpy.ndarray'
> >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
> 80 [True, False]
> 81 [True, False]
> 82 [False, False]
> 83 [False, False]
> 84 [False, False]
> {code}
> # Unfortunately this clever hack fails as well!
> {code}
> >>> c = np.empty(1, object)
> >>> c[0] = ["A", "T"]
> >>> mini[mini.ALTS.values == c]
> Traceback (most recent call last):
> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
> return self._engine.get_loc(key)
> File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
> File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: False
> >>> mini.ALTS.values == c
> False
> {code}
> Finally, what succeeds is the following (probably because of the immutability of tuple):
> {code}
> >>> mini["ALTS2"] = mini.ALTS.apply(tuple)
> >>> mini.head()
> CHROM POS ID REF ALTS QUAL ALTS2
> 80 20 63521 rs191905748 G [A] 100 (A,)
> 81 20 63541 rs117322527 C [A] 100 (A,)
> 82 20 63548 rs541129280 G [GT] 100 (GT,)
> 83 20 63553 rs536661806 T [C] 100 (C,)
> 84 20 63555 rs553463231 T [C] 100 (C,)
> >>> mini[mini["ALTS2"] == ("A", "T")]
> CHROM POS ID REF ALTS QUAL ALTS2
> 103 20 64139 rs186497980 G [A, T] 100 (A, T)
> >>> mini[mini["ALTS2"] == ("GT",)]
> CHROM POS ID REF ALTS QUAL ALTS2
> 82 20 63548 rs541129280 G [GT] 100 (GT,)
> >>> mini[mini["ALTS2"] == tuple("C")]
> CHROM POS ID REF ALTS QUAL ALTS2
> 83 20 63553 rs536661806 T [C] 100 (C,)
> 84 20 63555 rs553463231 T [C] 100 (C,)
> 89 20 63698 rs544072005 A [C] 100 (C,)
> 93 20 63808 rs76004960 G [C] 100 (C,)
> 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 (C,)
> 109 20 64210 rs182418654 G [C] 100 (C,)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)