You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yikun Jiang (Jira)" <ji...@apache.org> on 2022/01/18 05:21:00 UTC
[jira] (SPARK-37930) Fix DataFrame select subset with duplicated columns

    [ https://issues.apache.org/jira/browse/SPARK-37930 ]


    Yikun Jiang deleted comment on SPARK-37930:
    -------------------------------------

was (Author: yikunkero):
https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/frame.py#L12268

FYI:

{code:java}
self.loc[:, ['s1', 's2']]
Out[8]: 
      s1     s2
0  330.0  345.0
1  160.0    0.0
2    NaN   30.0
self.loc[:, ['s1', 's1']]
# raise the issue you mentioned
{code}

So, maybe we need to support loc index for ['s1', 's1']. you can use below test to retrigger:
# works
self.psdf.loc[:, ['s1', 's2']]
# not works
self.psdf.loc[:, ['s1', 's1']]

> Fix DataFrame select subset with duplicated columns
> ---------------------------------------------------
>
>                 Key: SPARK-37930
>                 URL: https://issues.apache.org/jira/browse/SPARK-37930
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.3.0
>            Reporter: dch nguyen
>            Priority: Major
>
> pandas
> {code:java}
> >>> pdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> pdf[['a', 'a']]
>    a  a
> 0  1  1
> 1  2  2
> 2  3  3
> 3  4  4 {code}
> pandas on spark
> {code:java}
> >>> psdf
>    a
> 0  1
> 1  2
> 2  3
> 3  4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
>     pdf = self._get_or_create_repr_pandas_cache(max_display_count)
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in _get_or_create_repr_pandas_cache
>     self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
>   File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in _to_internal_pandas
>     return self._internal.to_pandas_frame
>   File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in wrapped_lazy_property
>     setattr(self, attr_name, fn(self))
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in to_pandas_frame
>     return InternalFrame.restore_index(pdf, **self.arguments_for_restore_index)
>   File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in restore_index
>     pdf.columns = pd.Index(
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", line 5500, in __setattr__
>     return object.__setattr__(self, name, value)
>   File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", line 766, in _set_axis
>     self._mgr.set_axis(axis, labels)
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
>     self._validate_set_axis(axis, new_labels)
>   File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
>     raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org