You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/11/02 07:37:59 UTC

[jira] [Resolved] (SPARK-11868) wrong results returned from dataframe create from Rows without consistent schma on pyspark

     [ https://issues.apache.org/jira/browse/SPARK-11868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-11868.
----------------------------------
    Resolution: Cannot Reproduce

I still can't repdocue this in the master. It seems this is resolved somewhere but it is not obvious what fixed it or when. So, I am resolving this as Connot Reproduce. It seems the behaviour defined anyway.

Please feel free to revoke my action if anyone believes this is inappropriate.

> wrong results returned from dataframe create from Rows without consistent schma on pyspark
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11868
>                 URL: https://issues.apache.org/jira/browse/SPARK-11868
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.5.2
>         Environment: pyspark
>            Reporter: Yuval Tanny
>
> When schema is inconsistent (but is the sames for the 10 first rows), it's possible to create a dataframe form dictionaries and if a key is missing, its value is None. But when trying to create dataframe from corresponding rows, we get inconsistent behavior (wrong values for keys) without exception. See example below.
> The problems seems to be:
> 1. Not verifying all rows in schema.
> 2. In pyspark.sql.types._create_converter, None is being set when converting dictionary and field is not exist:
> {code}
> return tuple([conv(d.get(name)) for name, conv in zip(names, converters)])
> {code}
> But for Rows, it is just assumed that the number of fields in tuple is equal the number of in the inferred schema, and we place wrong values for wrong keys otherwise:
> {code}
> return tuple(conv(v) for v, conv in zip(obj, converters))
> {code}
> Thanks. 
> example:
> {code}
> dicts = [{'1':1,'2':2,'3':3}]*10+[{'1':1,'3':3}]
> rows = [pyspark.sql.Row(**r) for r in dicts]
> rows_rdd = sc.parallelize(rows)
> dicts_rdd = sc.parallelize(dicts)
> rows_df = sqlContext.createDataFrame(rows_rdd)
> dicts_df = sqlContext.createDataFrame(dicts_rdd)
> print(rows_df.select(['2']).collect()[10])
> print(dicts_df.select(['2']).collect()[10])
> {code}
> output:
> {code}
> Row(2=3)
> Row(2=None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org