You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/11/03 23:57:36 UTC

[jira] [Updated] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

     [ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust updated SPARK-2870:
------------------------------------
    Target Version/s: 1.3.0  (was: 1.2.0)

> Thorough schema inference directly on RDDs of Python dictionaries
> -----------------------------------------------------------------
>
>                 Key: SPARK-2870
>                 URL: https://issues.apache.org/jira/browse/SPARK-2870
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>            Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. 
> This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org