You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sylvain Zimmer (JIRA)" <ji...@apache.org> on 2016/07/25 16:17:20 UTC

[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

    [ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392200#comment-15392200 ] 

Sylvain Zimmer commented on SPARK-16700:
----------------------------------------

I dug into this a bit more: 

{{_verify_type({}, struct_schema)}} was already raising a similar exception in Spark 1.6.2, however schema validation wasn't being enforced at all by {{createDataFrame}} : https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/context.py#L418

In 2.0.0, it seems that it is done over each row of the data:
https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L504

I think there are 2 issues that should be fixed here:
 - {{_verify_type({}, struct_schema)}} shouldn't raise, because as far as I can tell dicts behave as expected and have their items correctly mapped as struct fields.
 - There should be a way to go back to 1.6.x-like behaviour and disable schema verification in {{createDataFrame}}. The {{prepare()}} function is being map()'d over all the data coming from Python, which I think will definitely hurt performance for large datasets and complex schemas. Leaving it on by default but adding a flag to disable it would be a good solution. Without this users will probably have to implement their own {{createDataFrame}} function like I did.


> StructType doesn't accept Python dicts anymore
> ----------------------------------------------
>
>                 Key: SPARK-16700
>                 URL: https://issues.apache.org/jira/browse/SPARK-16700
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python <dict> type, which is very handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to remain the same. MapType is probably wasteful when your key names never change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
>     SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in type <type 'dict'>
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org