You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Enrico Minack (Jira)" <ji...@apache.org> on 2022/05/26 08:27:00 UTC
[jira] [Updated] (SPARK-39292) Make Dataset.melt work with struct fields

     [ https://issues.apache.org/jira/browse/SPARK-39292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrico Minack updated SPARK-39292:
----------------------------------
    Description: 
In SPARK-38864, the melt function was added to Dataset.

It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:

Given a Dataset with following schema:
{code:java}
root
 |-- an: struct (nullable = false)
 |    |-- id: integer (nullable = false)
 |-- str: struct (nullable = false)
 |    |-- one: string (nullable = true)
 |    |-- two: string (nullable = true)
{code}

For example:
{code:java}
+---+-------------+
| an|          str|
+---+-------------+
|{1}|   {one, One}|
|{2}|  {two, null}|
|{3}|{null, three}|
|{4}| {null, null}|
+---+-------------+
{code}
Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
{code:java}
+--+--------+-----+
|an|variable|value|
+--+--------+-----+
| 1| str.one|  one|
| 1| str.two|  One|
| 2| str.one|  two|
| 2| str.two| null|
| 3| str.one| null|
| 3| str.two|three|
| 4| str.one| null|
| 4| str.two| null|
+--+--------+-----+
{code}

See test in {{org.apache.spark.sql.MeltSuite}}:
{code:java}
  test("SPARK-39292: melt with struct fields") {
    val df = meltWideDataDs.select(
      struct($"id").as("an"),
      struct(
        $"str1".as("one"),
        $"str2".as("two")
      ).as("str")
    )

    checkAnswer(
      Melt.of(df, Seq("an.id"), Seq("str.one", "str.two"), false, "variable", "value"),
      meltedWideDataRows.map(row => Row(
        row.getInt(0),
        row.getString(1) match {
          case "str1" => "str.one"
          case "str2" => "str.two"
        },
        row.getString(2)
      ))
    )
  }
{code}

  was:
In SPARK-38864, the melt function was added to Dataset.

It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:

Given a Dataset with following schema:
{code:java}
root
 |-- an: struct (nullable = false)
 |    |-- id: integer (nullable = false)
 |-- str: struct (nullable = false)
 |    |-- one: string (nullable = true)
 |    |-- two: string (nullable = true)
{code}

For example:
{code:java}
+---+-------------+
| an|          str|
+---+-------------+
|{1}|   {one, One}|
|{2}|  {two, null}|
|{3}|{null, three}|
|{4}| {null, null}|
+---+-------------+
{code}
Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
{code:java}
+--+--------+-----+
|an|variable|value|
+--+--------+-----+
| 1| str.one|  one|
| 1| str.two|  One|
| 2| str.one|  two|
| 2| str.two| null|
| 3| str.one| null|
| 3| str.two|three|
| 4| str.one| null|
| 4| str.two| null|
+--+--------+-----+
{code}

See test in {{org.apache.spark.sql.MeltSuite}}:
{code:java}
  test("melt with struct fields") {
    val df = meltWideDataDs.select(
      struct($"id").as("an"),
      struct(
        $"str1".as("one"),
        $"str2".as("two")
      ).as("str")
    )

    checkAnswer(
      Melt.of(df, Seq("an.id"), Seq("str.one", "str.two")),
      meltedWideDataRows.map(row => Row(
        row.getInt(0),
        row.getString(1) match {
          case "str1" => "str.one"
          case "str2" => "str.two"
        },
        row.getString(2)
      ))
    )
  }
{code}


> Make Dataset.melt work with struct fields
> -----------------------------------------
>
>                 Key: SPARK-39292
>                 URL: https://issues.apache.org/jira/browse/SPARK-39292
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Enrico Minack
>            Priority: Major
>
> In SPARK-38864, the melt function was added to Dataset.
> It would be nice if fields of struct fields could be used as id and value columns. This would allow for the following:
> Given a Dataset with following schema:
> {code:java}
> root
>  |-- an: struct (nullable = false)
>  |    |-- id: integer (nullable = false)
>  |-- str: struct (nullable = false)
>  |    |-- one: string (nullable = true)
>  |    |-- two: string (nullable = true)
> {code}
> For example:
> {code:java}
> +---+-------------+
> | an|          str|
> +---+-------------+
> |{1}|   {one, One}|
> |{2}|  {two, null}|
> |{3}|{null, three}|
> |{4}| {null, null}|
> +---+-------------+
> {code}
> Melting with value columns {{Seq("str.one", "str.two")}} on id columns {{Seq("an.id")}} would result in
> {code:java}
> +--+--------+-----+
> |an|variable|value|
> +--+--------+-----+
> | 1| str.one|  one|
> | 1| str.two|  One|
> | 2| str.one|  two|
> | 2| str.two| null|
> | 3| str.one| null|
> | 3| str.two|three|
> | 4| str.one| null|
> | 4| str.two| null|
> +--+--------+-----+
> {code}
> See test in {{org.apache.spark.sql.MeltSuite}}:
> {code:java}
>   test("SPARK-39292: melt with struct fields") {
>     val df = meltWideDataDs.select(
>       struct($"id").as("an"),
>       struct(
>         $"str1".as("one"),
>         $"str2".as("two")
>       ).as("str")
>     )
>     checkAnswer(
>       Melt.of(df, Seq("an.id"), Seq("str.one", "str.two"), false, "variable", "value"),
>       meltedWideDataRows.map(row => Row(
>         row.getInt(0),
>         row.getString(1) match {
>           case "str1" => "str.one"
>           case "str2" => "str.two"
>         },
>         row.getString(2)
>       ))
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org