You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2014/08/06 01:09:12 UTC

[jira] [Created] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

Nicholas Chammas created SPARK-2870:
---------------------------------------

             Summary: Thorough schema inference directly on RDDs of Python dictionaries
                 Key: SPARK-2870
                 URL: https://issues.apache.org/jira/browse/SPARK-2870
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
            Reporter: Nicholas Chammas


h4. Background

I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set. 

This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.

For example:

{code}
{"a": 5}
{"a": "cow"}
{code}

To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well. 

h4. Feature Request

What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this:


{code}
SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
{code}

As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.

h4. Example Use Case

* You have some JSON text data that you want to analyze using Spark SQL. 
* You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
* You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries.
* From this RDD, you want a SchemaRDD that captures the schema for the whole data set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org