You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tycho Grouwstra (JIRA)" <ji...@apache.org> on 2015/10/31 13:51:27 UTC

[jira] [Comment Edited] (SPARK-11431) Allow exploding arrays of structs in DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-11431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983980#comment-14983980 ] 

Tycho Grouwstra edited comment on SPARK-11431 at 10/31/15 12:51 PM:
--------------------------------------------------------------------

Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you):

```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :)



was (Author: tycho01):
Hah, I actually missed that, so my bad, thanks!

What I'm doing now then is along the lines of the following (example contrived from [here](http://xinhstechblog.blogspot.com/2015/06/reading-json-data-in-spark-dataframes.html), only just realized this was you):
```
val json = """{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}"""
val peopleDf = sqlContext.read.json(sc.parallelize(json :: Nil))
import org.apache.spark.sql.functions._
peopleDf.select($"*", explode($"schools").as("school")).drop("schools").show
```

I must say I've felt slightly puzzled with the convention of having to use `explode` as part of an embedding `select` statement though; as an unwitting user, I'd feel `df.explode($"col")` should do something functionally equivalent to the current `df.select($"*", explode($"col"))` without having to type that out though. In my naivety, I'd wonder 'if I wanted to also select just a subset of columns, I could just manually add a `select` to do so myself'.

Obviously, changing user APIs is bad, and not everyone will have identical expectations, but I'm just kind of curious. Was this an artifact of performance considerations, or a deliberate part of a larger philosophy of having the syntax be as explicit as possible?

Then again, aside from keeping existing column, to me the `drop` on the pre-`explode` column would often seem a sensible default as well, so point taken that expectations may differ, in which case defaulting to whatever takes least processing definitely seems a sane choice.


Something else I was thinking about though would be an `explodeZipped` type of function, to explode multiple equally-sized-array columns together, as opposed to chaining separate explodes to form a Cartesian. I was still sort of looking into that, but... at this point I'd wonder if perhaps I've overlooked existing functionality for that as well. :)


> Allow exploding arrays of structs in DataFrames
> -----------------------------------------------
>
>                 Key: SPARK-11431
>                 URL: https://issues.apache.org/jira/browse/SPARK-11431
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Tycho Grouwstra
>              Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am creating DataFrames from some [JSON data](http://www.kayak.com/h/explore/api?airport=AMS), and would like to explode an array of structs (as are common in JSON) to their own rows so I could start analyzing the data using GraphX. I believe many others might have use for this as well, since most web data is in JSON format.
> This feature would build upon the existing `explode` functionality added to DataFrames by [~marmbrus], which currently errors when you call it on such arrays of `InternalRow`s. This relates to `explode`'s use of the schemaFor function to infer column types -- this approach is insufficient in the case of Rows, since their type does not contain the required info. The alternative here would be to instead grab the schema info from the existing schema for such cases.
> I'm trying to implement a patch that might add this functionality, so stay tuned until I've figured that out. I'm new here though so I'll probably have use for some feedback...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org