You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bago Amirbekian (JIRA)" <ji...@apache.org> on 2017/10/10 02:51:00 UTC
[jira] [Created] (SPARK-22232) Row objects in pyspark using the
`Row(**kwars)` syntax do not get serialized/deserialized properly
Bago Amirbekian created SPARK-22232:
---------------------------------------
Summary: Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
Key: SPARK-22232
URL: https://issues.apache.org/jira/browse/SPARK-22232
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.2.0
Reporter: Bago Amirbekian
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue.
```
from pyspark.sql.types import *
from pyspark.sql import *
def toRow(i):
return Row(a="a", c=3.0, b=2)
schema = StructType([
StructField("a", StringType(), False),
StructField("c", FloatType(), False),
StructField("b", IntegerType(), False),
])
rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)
# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org