You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/09/01 00:09:21 UTC
[jira] [Commented] (SPARK-2870) Thorough schema inference directly
on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14116915#comment-14116915 ]
Michael Armbrust commented on SPARK-2870:
-----------------------------------------
Yeah, thanks for pinging me. I've targeted this JIRA for 1.2.
> Thorough schema inference directly on RDDs of Python dictionaries
> -----------------------------------------------------------------
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. They process JSON text directly and infer a schema that covers the entire source data set.
> This is very important with semi-structured data like JSON since individual elements in the data set are free to have different structures. Matching fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a schema by looking at the whole data set. The aforementioned {{SQLContext.json...()}} methods do this very well.
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. Alternatively, we need a new {{SQLContext}} method that works on RDDs of Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] just looks at the first element in the data set. This won't help much when the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL.
> * You would use one of the {{SQLContext.json...()}} methods, but you need to do some filtering on the data first to remove bad elements--basically, some minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org