You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Justin Pihony <ju...@gmail.com> on 2015/03/06 03:11:10 UTC

SparkSQL JSON array support

Is there any plans of supporting JSON arrays more fully? Take for example:

    val myJson =
sqlContext.jsonRDD(List("""{"foo":[{"bar":1},{"baz":2}]}"""))
    myJson.registerTempTable("JsonTest")

I would like a way to pull out parts of the array data based on a key

    sql("""SELECT foo["bar"] FROM JsonTest""") //projects only the object
with bar, the rest would be null
 
I could even work around this if there was some way to access the key name
from the SchemaRDD:

    myJson.filter(x=>x(0).asInstanceOf[Seq[Row]].exists(y=>y.key == "bar"))
        .map(x=>x(0).asInstanceOf[Seq[Row]].filter(y=>y.key == "bar")) 
    //This does the same as above, except also filtering out those without a
bar key

This is the closest suggestion I could find thus far,
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView> 
which still does not solve the problem of pulling out the keys.

I tried with a UDF also, but could not currently make that work either.

If there isn't anything in the works, then would it be appropriate to create
a ticket for this?

Thanks,
Justin



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SparkSQL JSON array support

Posted by Michael Armbrust <mi...@databricks.com>.
You can do want with lateral view explode, but what seems to be missing is
that jsonRDD converts json objects into structs (fixed keys with a fixed
order) and fields in a struct are accessed using a `.`

val myJson = sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}"""
:: Nil))
myJson.registerTempTable("JsonTest")​val result = sql("SELECT f.bar
FROM JsonTest LATERAL VIEW explode(foo) a AS f").collect()

myJson: org.apache.spark.sql.DataFrame = [foo:
array<struct<bar:bigint,baz:bigint>>]
result: Array[org.apache.spark.sql.Row] = Array([1], [null])


In Spark 1.3 you can also hint to jsonRDD that you'd like the json objects
converted into Maps (non-uniform keys) instead of structs, by manually
specifying the schema of your JSON.

import org.apache.spark.sql.types._val schema =
  StructType(
    StructField("foo", ArrayType(MapType(StringType, IntegerType))) :: Nil)
​
sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}"""
:: Nil), schema).registerTempTable("jsonTest")
​val withSql = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo) a
AS a WHERE a['bar'] IS NOT NULL").collect()
​val withSpark = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo)
a AS a").rdd.filter  {
  case Row(a: Map[String, Int]) if a.contains("bar") => true
  case _: Row => false
}.collect()

schema: org.apache.spark.sql.types.StructType =
StructType(StructField(foo,ArrayType(MapType(StringType,IntegerType,true),true),true))
withSql: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])
withSpark: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])


Michael

On Thu, Mar 5, 2015 at 6:11 PM, Justin Pihony <ju...@gmail.com>
wrote:

> Is there any plans of supporting JSON arrays more fully? Take for example:
>
>     val myJson =
> sqlContext.jsonRDD(List("""{"foo":[{"bar":1},{"baz":2}]}"""))
>     myJson.registerTempTable("JsonTest")
>
> I would like a way to pull out parts of the array data based on a key
>
>     sql("""SELECT foo["bar"] FROM JsonTest""") //projects only the object
> with bar, the rest would be null
>
> I could even work around this if there was some way to access the key name
> from the SchemaRDD:
>
>     myJson.filter(x=>x(0).asInstanceOf[Seq[Row]].exists(y=>y.key == "bar"))
>         .map(x=>x(0).asInstanceOf[Seq[Row]].filter(y=>y.key == "bar"))
>     //This does the same as above, except also filtering out those without
> a
> bar key
>
> This is the closest suggestion I could find thus far,
> <
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
> >
> which still does not solve the problem of pulling out the keys.
>
> I tried with a UDF also, but could not currently make that work either.
>
> If there isn't anything in the works, then would it be appropriate to
> create
> a ticket for this?
>
> Thanks,
> Justin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>