You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (Jira)" <ji...@apache.org> on 2020/01/10 22:43:00 UTC
[jira] [Resolved] (SPARK-22232) Row objects in pyspark created
using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler resolved SPARK-22232.
----------------------------------
Resolution: Won't Fix
Closing in favor for fix in SPARK-29748
> Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.2.0
> Reporter: Bago Amirbekian
> Priority: Major
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
> return Row(a="a", c=3.0, b=2)
> schema = StructType([
> # Putting fields in alphabetical order masks the issue
> StructField("a", StringType(), False),
> StructField("c", FloatType(), False),
> StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org