You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2022/05/26 08:27:00 UTC
[jira] [Updated] (SPARK-39292) Make Dataset.melt work with struct fields
[ https://issues.apache.org/jira/browse/SPARK-39292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enrico Minack updated SPARK-39292:
----------------------------------
Description:
In SPARK-38864, the melt function was added to Dataset.
It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:
Given a Dataset with following schema:
{code:java}
root
|-- an: struct (nullable = false)
| |-- id: integer (nullable = false)
|-- str: struct (nullable = false)
| |-- one: string (nullable = true)
| |-- two: string (nullable = true)
{code}
For example:
{code:java}
+---+-------------+
| an| str|
+---+-------------+
|{1}| {one, One}|
|{2}| {two, null}|
|{3}|{null, three}|
|{4}| {null, null}|
+---+-------------+
{code}
Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
{code:java}
+--+--------+-----+
|an|variable|value|
+--+--------+-----+
| 1| str.one| one|
| 1| str.two| One|
| 2| str.one| two|
| 2| str.two| null|
| 3| str.one| null|
| 3| str.two|three|
| 4| str.one| null|
| 4| str.two| null|
+--+--------+-----+
{code}
See test in {{org.apache.spark.sql.MeltSuite}}:
{code:java}
test("SPARK-39292: melt with struct fields") {
val df = meltWideDataDs.select(
struct($"id").as("an"),
struct(
$"str1".as("one"),
$"str2".as("two")
).as("str")
)
checkAnswer(
Melt.of(df, Seq("an.id"), Seq("str.one", "str.two"), false, "variable", "value"),
meltedWideDataRows.map(row => Row(
row.getInt(0),
row.getString(1) match {
case "str1" => "str.one"
case "str2" => "str.two"
},
row.getString(2)
))
)
}
{code}
was:
In SPARK-38864, the melt function was added to Dataset.
It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:
Given a Dataset with following schema:
{code:java}
root
|-- an: struct (nullable = false)
| |-- id: integer (nullable = false)
|-- str: struct (nullable = false)
| |-- one: string (nullable = true)
| |-- two: string (nullable = true)
{code}
For example:
{code:java}
+---+-------------+
| an| str|
+---+-------------+
|{1}| {one, One}|
|{2}| {two, null}|
|{3}|{null, three}|
|{4}| {null, null}|
+---+-------------+
{code}
Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
{code:java}
+--+--------+-----+
|an|variable|value|
+--+--------+-----+
| 1| str.one| one|
| 1| str.two| One|
| 2| str.one| two|
| 2| str.two| null|
| 3| str.one| null|
| 3| str.two|three|
| 4| str.one| null|
| 4| str.two| null|
+--+--------+-----+
{code}
See test in {{org.apache.spark.sql.MeltSuite}}:
{code:java}
test("melt with struct fields") {
val df = meltWideDataDs.select(
struct($"id").as("an"),
struct(
$"str1".as("one"),
$"str2".as("two")
).as("str")
)
checkAnswer(
Melt.of(df, Seq("an.id"), Seq("str.one", "str.two")),
meltedWideDataRows.map(row => Row(
row.getInt(0),
row.getString(1) match {
case "str1" => "str.one"
case "str2" => "str.two"
},
row.getString(2)
))
)
}
{code}
> Make Dataset.melt work with struct fields
> -----------------------------------------
>
> Key: SPARK-39292
> URL: https://issues.apache.org/jira/browse/SPARK-39292
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.4.0
> Reporter: Enrico Minack
> Priority: Major
>
> In SPARK-38864, the melt function was added to Dataset.
> It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:
> Given a Dataset with following schema:
> {code:java}
> root
> |-- an: struct (nullable = false)
> | |-- id: integer (nullable = false)
> |-- str: struct (nullable = false)
> | |-- one: string (nullable = true)
> | |-- two: string (nullable = true)
> {code}
> For example:
> {code:java}
> +---+-------------+
> | an| str|
> +---+-------------+
> |{1}| {one, One}|
> |{2}| {two, null}|
> |{3}|{null, three}|
> |{4}| {null, null}|
> +---+-------------+
> {code}
> Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
> {code:java}
> +--+--------+-----+
> |an|variable|value|
> +--+--------+-----+
> | 1| str.one| one|
> | 1| str.two| One|
> | 2| str.one| two|
> | 2| str.two| null|
> | 3| str.one| null|
> | 3| str.two|three|
> | 4| str.one| null|
> | 4| str.two| null|
> +--+--------+-----+
> {code}
> See test in {{org.apache.spark.sql.MeltSuite}}:
> {code:java}
> test("SPARK-39292: melt with struct fields") {
> val df = meltWideDataDs.select(
> struct($"id").as("an"),
> struct(
> $"str1".as("one"),
> $"str2".as("two")
> ).as("str")
> )
> checkAnswer(
> Melt.of(df, Seq("an.id"), Seq("str.one", "str.two"), false, "variable", "value"),
> meltedWideDataRows.map(row => Row(
> row.getInt(0),
> row.getString(1) match {
> case "str1" => "str.one"
> case "str2" => "str.two"
> },
> row.getString(2)
> ))
> )
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org