You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jeff Evans (Jira)" <ji...@apache.org> on 2020/05/20 21:24:00 UTC

[jira] [Created] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

Jeff Evans created SPARK-31779:
----------------------------------

             Summary: Redefining struct inside array incorrectly wraps child fields in array
                 Key: SPARK-31779
                 URL: https://issues.apache.org/jira/browse/SPARK-31779
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.5
            Reporter: Jeff Evans


It seems that redefining a {{struct}} for the purpose of removing a sub-field, when that {{struct}} is itself inside an {{array}}, results in the remaining (non-removed) {{struct}} fields themselves being incorrectly wrapped in an array.

For more context, see [this|https://stackoverflow.com/a/46084983/375670] StackOverflow answer and discussion thread.  I have debugged this code and distilled it down to what I believe represents a bug in Spark itself.

Consider the following {{spark-shell}} session (version 2.4.5):

{code}
// use a nested JSON structure that contains a struct inside an array
val jsonData = """{
  "foo": "bar",
  "top": {
    "child1": 5,
    "child2": [
      {
        "child2First": "one",
        "child2Second": 2
      }
    ]
  }
}"""

// read into a DataFrame
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())

// create a new definition for "top", which will remove the "top.child2.child2First" column

val newTop = struct(df("top").getField("child1").alias("child1"), array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))

// show the schema before and after swapping out the struct definition
df.schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY<STRUCT<`child2First`: STRING, `child2Second`: BIGINT>>>
df.withColumn("top", newTop).schema.toDDL
// `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY<STRUCT<`child2Second`: ARRAY<BIGINT>>>>
{code}

Notice in this case that the new definition for {{top.child2.child2Second}} is an {{ARRAY<BIGINT>}}.  This is incorrect; it should simply be {{BIGINT}}.  There is nothing in the definition of the {{newTop}} {{struct}} that should have caused the type to become wrapped in an array like this.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org