You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by boclair <bo...@gmail.com> on 2014/11/07 21:41:20 UTC
jsonRdd and MapType
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.
For example:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> val jsonRdd = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
"""{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRdd = sqlContext.jsonRDD(jsonRdd)
> schemaRdd.printSchema
root
|-- attributes: struct (nullable = true)
| |-- gender: string (nullable = true)
| |-- location: string (nullable = true)
|-- key: string (nullable = true)
> schemaRdd.collect
res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
[[null,nyc],4321])
However this isn't what I want. So I created my own StructType to pass to
the jsonRDD call:
> import org.apache.spark.sql._
> val st = StructType(Seq(StructField("key", StringType, false),
StructField("attributes",
MapType(StringType, StringType, false))))
> val jsonRddSt = sc.parallelize(Seq("""{"key": "1234", "attributes":
> {"gender": "m"}}""",
"""{"key": "4321",
"attributes": {"location": "nyc"}}"""))
> val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
> schemaRddSt.printSchema
root
|-- key: string (nullable = false)
|-- attributes: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = false)
> schemaRddSt.collect
*** Failure ***
scala.MatchError: MapType(StringType,StringType,false) (of class
org.apache.spark.sql.catalyst.types.MapType)
at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
...
The schema of the schemaRDD is correct. But it seems that the json cannot
be coerced to a MapType. I can see at the line in the stack trace that
there is no case statement for MapType. Is there something I'm missing? Is
this a bug or decision to not support MapType with json?
Thanks,
Brian
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: jsonRdd and MapType
Posted by Yin Huai <hu...@gmail.com>.
Hello Brian,
Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.
Thanks,
Yin
On Fri, Nov 7, 2014 at 3:41 PM, boclair <bo...@gmail.com> wrote:
> I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
> I'd like some of the json fields to be in a MapType rather than a sub
> StructType, as the keys will be very sparse.
>
> For example:
> > val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > import sqlContext.createSchemaRDD
> > val jsonRdd = sc.parallelize(Seq("""{"key": "1234", "attributes":
> > {"gender": "m"}}""",
> """{"key": "4321",
> "attributes": {"location": "nyc"}}"""))
> > val schemaRdd = sqlContext.jsonRDD(jsonRdd)
> > schemaRdd.printSchema
> root
> |-- attributes: struct (nullable = true)
> | |-- gender: string (nullable = true)
> | |-- location: string (nullable = true)
> |-- key: string (nullable = true)
> > schemaRdd.collect
> res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
> [[null,nyc],4321])
>
>
> However this isn't what I want. So I created my own StructType to pass to
> the jsonRDD call:
>
> > import org.apache.spark.sql._
> > val st = StructType(Seq(StructField("key", StringType, false),
> StructField("attributes",
> MapType(StringType, StringType, false))))
> > val jsonRddSt = sc.parallelize(Seq("""{"key": "1234", "attributes":
> > {"gender": "m"}}""",
> """{"key": "4321",
> "attributes": {"location": "nyc"}}"""))
> > val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
> > schemaRddSt.printSchema
> root
> |-- key: string (nullable = false)
> |-- attributes: map (nullable = true)
> | |-- key: string
> | |-- value: string (valueContainsNull = false)
> > schemaRddSt.collect
> *** Failure ***
> scala.MatchError: MapType(StringType,StringType,false) (of class
> org.apache.spark.sql.catalyst.types.MapType)
> at
> org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
> ...
>
> The schema of the schemaRDD is correct. But it seems that the json cannot
> be coerced to a MapType. I can see at the line in the stack trace that
> there is no case statement for MapType. Is there something I'm missing?
> Is
> this a bug or decision to not support MapType with json?
>
> Thanks,
> Brian
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>