You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xinrong Meng (Jira)" <ji...@apache.org> on 2022/06/21 22:08:00 UTC
[jira] [Updated] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

     [ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xinrong Meng updated SPARK-39550:
---------------------------------
    Description: 
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

Especially, `value_counts` returns an Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instead.

 

  was:
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,

 

 
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

 

Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad.

 


> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---------------------------------------------------------------
>
>                 Key: SPARK-39550
>                 URL: https://issues.apache.org/jira/browse/SPARK-39550
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark, PySpark
>    Affects Versions: 3.4.0
>            Reporter: Xinrong Meng
>            Priority: Major
>
>  
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'}    1
> {'__index_level_0__': 2, '__index_level_1__': 'b'}    1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a)    1
> (2, b)    1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org