You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Pablo Langa Blanco (Jira)" <ji...@apache.org> on 2020/05/23 16:17:00 UTC

[jira] [Commented] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array

    [ https://issues.apache.org/jira/browse/SPARK-31779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114866#comment-17114866 ] 

Pablo Langa Blanco commented on SPARK-31779:
--------------------------------------------

Hi [~jeff.w.evans] 

I think this is not a bug and this is the correct behavior. 

Your input has this structure:

 
{code:java}
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 |    |-- child1: long (nullable = true)
 |    |-- child2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- child2First: string (nullable = true)
 |    |    |    |-- child2Second: long (nullable = true)
{code}
When you do this df("top").getField("child2") you have an array element than have an struct inside

 
{code:java}
child2: array (nullable = true)
    |-- element: struct (containsNull = true)
    |    |-- child2First: string (nullable = true)
    |    |-- child2Second: long (nullable = true)
{code}
When you do .getField("child2Second") over this structure you are accessing to the elements of the internal struct, but this is in an array so the return is an array with the element that you have selected

 
{code:java}
child2: array (nullable = true)
    |-- child2Second: long (nullable = true)
{code}
 

With this example I think it’s more clear

 
{code:java}
val jsonData = """{
     |   "foo": "bar",
     |   "top": {
     |     "child1": 5,
     |     "child2": [
     |       {
     |         "child2First": "one",
     |         "child2Second": 2
     |       },
     |       {
     |         "child2First": "xxxx",
     |         "child2Second": 3
     |       }
     |     ]
     |   }
     | }"""
val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
val newTop = df("top").getField("child2").getField("child2Second")
df.withColumn("top2", newTop).printSchema
root
 |-- foo: string (nullable = true)
 |-- top: struct (nullable = true)
 |    |-- child1: long (nullable = true)
 |    |-- child2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- child2First: string (nullable = true)
 |    |    |    |-- child2Second: long (nullable = true)
 |-- top2: array (nullable = true)
 |    |-- element: long (containsNull = true)
df.withColumn("top2", newTop).show(truncate=false)
+---+--------------------------+------+
|foo|top                       |top2  |
+---+--------------------------+------+
|bar|[5, [[one, 2], [xxxx, 3]]]|[2, 3]|
+---+--------------------------+------+
{code}
 

Looking at the code it is explicitly designed that way, so if you agree I think we should close the issue

> Redefining struct inside array incorrectly wraps child fields in array
> ----------------------------------------------------------------------
>
>                 Key: SPARK-31779
>                 URL: https://issues.apache.org/jira/browse/SPARK-31779
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Jeff Evans
>            Priority: Major
>
> It seems that redefining a {{struct}} for the purpose of removing a sub-field, when that {{struct}} is itself inside an {{array}}, results in the remaining (non-removed) {{struct}} fields themselves being incorrectly wrapped in an array.
> For more context, see [this|https://stackoverflow.com/a/46084983/375670] StackOverflow answer and discussion thread.  I have debugged this code and distilled it down to what I believe represents a bug in Spark itself.
> Consider the following {{spark-shell}} session (version 2.4.5):
> {code}
> // use a nested JSON structure that contains a struct inside an array
> val jsonData = """{
>   "foo": "bar",
>   "top": {
>     "child1": 5,
>     "child2": [
>       {
>         "child2First": "one",
>         "child2Second": 2
>       }
>     ]
>   }
> }"""
> // read into a DataFrame
> val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS())
> // create a new definition for "top", which will remove the "top.child2.child2First" column
> val newTop = struct(df("top").getField("child1").alias("child1"), array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2"))
> // show the schema before and after swapping out the struct definition
> df.schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY<STRUCT<`child2First`: STRING, `child2Second`: BIGINT>>>
> df.withColumn("top", newTop).schema.toDDL
> // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY<STRUCT<`child2Second`: ARRAY<BIGINT>>>>
> {code}
> Notice in this case that the new definition for {{top.child2.child2Second}} is an {{ARRAY<BIGINT>}}.  This is incorrect; it should simply be {{BIGINT}}.  There is nothing in the definition of the {{newTop}} {{struct}} that should have caused the type to become wrapped in an array like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org