You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2014/12/11 00:10:41 UTC

Filtering nested data using Spark SQL

Hi spark users,

I'm trying to filter a json file that has the following schema using Spark
SQL:

root
 |-- user_id: string (nullable = true)
 |-- item: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- name: string (nullable = true)

I would like to filter distinct user_id based on the items it contains. For
instance, I would like to find out distinct user_id which has item's name
equal to "apple" for the following 'user' table

user_id | item
1 | ([1, apple], [1, apple], [2, orange])
2 | ([2, orange])

The result should be 1

I tried using hql:
select user_id from user lateral view explode(item) itemTable as itemColumn
where itemColumn.name = 'apple' group by user_id

but it seems not efficient if I can just stop looking through the item
array once I find the first item with name 'apple'. Also the "lateral view
explode" and "group by" are unnecessary.

I'm thinking of processing the 'user' table as SchemaRDD. Ideally, I would
love to do (assuming user is a SchemaRDD):

val ids =user.select('user_id).where(contain('item, "name",
"apple")).collect()

the contain function will loop through the item with "name" = "apple" with
early stopping.

Is this possible? If yes,  how one implements the contain function?

Best Regards,

Jerry