You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2014/12/11 00:10:41 UTC
Filtering nested data using Spark SQL
Hi spark users,
I'm trying to filter a json file that has the following schema using Spark
SQL:
root
|-- user_id: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- item_id: string (nullable = true)
| | |-- name: string (nullable = true)
I would like to filter distinct user_id based on the items it contains. For
instance, I would like to find out distinct user_id which has item's name
equal to "apple" for the following 'user' table
user_id | item
1 | ([1, apple], [1, apple], [2, orange])
2 | ([2, orange])
The result should be 1
I tried using hql:
select user_id from user lateral view explode(item) itemTable as itemColumn
where itemColumn.name = 'apple' group by user_id
but it seems not efficient if I can just stop looking through the item
array once I find the first item with name 'apple'. Also the "lateral view
explode" and "group by" are unnecessary.
I'm thinking of processing the 'user' table as SchemaRDD. Ideally, I would
love to do (assuming user is a SchemaRDD):
val ids =user.select('user_id).where(contain('item, "name",
"apple")).collect()
the contain function will loop through the item with "name" = "apple" with
early stopping.
Is this possible? If yes, how one implements the contain function?
Best Regards,
Jerry