You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yikun Jiang (Jira)" <ji...@apache.org> on 2022/01/18 04:32:00 UTC
[jira] [Comment Edited] (SPARK-37930) Fix DataFrame select subset with duplicated columns
[ https://issues.apache.org/jira/browse/SPARK-37930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477561#comment-17477561 ]
Yikun Jiang edited comment on SPARK-37930 at 1/18/22, 4:31 AM:
---------------------------------------------------------------
https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/frame.py#L12268
FYI:
{code:java}
self.loc[:, ['s1', 's2']]
Out[8]:
s1 s2
0 330.0 345.0
1 160.0 0.0
2 NaN 30.0
self.loc[:, ['s1', 's1']]
# raise the issue you mentioned
{code}
So, maybe we need to support loc index for ['s1', 's1']. you can use below test to retrigger:
# works
self.psdf.loc[:, ['s1', 's2']]
# not works
self.psdf.loc[:, ['s1', 's1']]
was (Author: yikunkero):
https://github.com/apache/spark/blob/df7447bc62052e3d7391ba23d7220fb8c9b923fd/python/pyspark/pandas/frame.py#L12268
FYI:
{code:java}
self.loc[:, ['s1', 's2']]
Out[8]:
s1 s2
0 330.0 345.0
1 160.0 0.0
2 NaN 30.0
self.loc[:, ['s1', 's1']]
# raise the issue you mentioned
{code}
So, maybe we need to support loc index for ['s1', 's1'].
> Fix DataFrame select subset with duplicated columns
> ---------------------------------------------------
>
> Key: SPARK-37930
> URL: https://issues.apache.org/jira/browse/SPARK-37930
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.3.0
> Reporter: dch nguyen
> Priority: Major
>
> pandas
> {code:java}
> >>> pdf
> a
> 0 1
> 1 2
> 2 3
> 3 4
> >>> pdf[['a', 'a']]
> a a
> 0 1 1
> 1 2 2
> 2 3 3
> 3 4 4 {code}
> pandas on spark
> {code:java}
> >>> psdf
> a
> 0 1
> 1 2
> 2 3
> 3 4
> >>> psdf[['a', 'a']]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/u02/spark/python/pyspark/pandas/frame.py", line 12077, in __repr__
> pdf = self._get_or_create_repr_pandas_cache(max_display_count)
> File "/u02/spark/python/pyspark/pandas/frame.py", line 12068, in _get_or_create_repr_pandas_cache
> self, "_repr_pandas_cache", {n: self.head(n + 1)._to_internal_pandas()}
> File "/u02/spark/python/pyspark/pandas/frame.py", line 12063, in _to_internal_pandas
> return self._internal.to_pandas_frame
> File "/u02/spark/python/pyspark/pandas/utils.py", line 576, in wrapped_lazy_property
> setattr(self, attr_name, fn(self))
> File "/u02/spark/python/pyspark/pandas/internal.py", line 1055, in to_pandas_frame
> return InternalFrame.restore_index(pdf, **self.arguments_for_restore_index)
> File "/u02/spark/python/pyspark/pandas/internal.py", line 1156, in restore_index
> pdf.columns = pd.Index(
> File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", line 5500, in __setattr__
> return object.__setattr__(self, name, value)
> File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
> File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/generic.py", line 766, in _set_axis
> self._mgr.set_axis(axis, labels)
> File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
> self._validate_set_axis(axis, new_labels)
> File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
> raise ValueError(
> ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org