You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wenchen Fan (Jira)" <ji...@apache.org> on 2021/08/24 15:35:00 UTC

[jira] [Updated] (SPARK-35876) array_zip unexpected column names

     [ https://issues.apache.org/jira/browse/SPARK-35876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wenchen Fan updated SPARK-35876:
--------------------------------
    Fix Version/s: 3.0.4
                   3.1.3

> array_zip unexpected column names
> ---------------------------------
>
>                 Key: SPARK-35876
>                 URL: https://issues.apache.org/jira/browse/SPARK-35876
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.2
>            Reporter: Derk Crezee
>            Assignee: Kousuke Saruta
>            Priority: Major
>             Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> {{When I'm using the array_zip function in combination with renamed columns, I get an unexpected schema written to disk.}}
> {code:java}
> // code placeholder
> from pyspark.sql import * 
> from pyspark.sql.functions import *
> spark = SparkSession.builder.getOrCreate()
> data = [
>   Row(a1=["a", "a"], b1=["b", "b"]),
> ]
> df = (
>   spark.sparkContext.parallelize(data).toDF()
>     .withColumnRenamed("a1", "a2")
>     .withColumnRenamed("b1", "b2")
>     .withColumn("zipped", arrays_zip(col("a2"), col("b2")))
> )
> df.printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  |    |-- element: struct (containsNull = false)
> //  |    |    |-- a2: string (nullable = true)
> //  |    |    |-- b2: string (nullable = true)
> df.write.save("test.parquet")
> spark.read.load("test.parquet").printSchema()
> // root
> //  |-- a2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- b2: array (nullable = true)
> //  |    |-- element: string (containsNull = true)
> //  |-- zipped: array (nullable = true)
> //  |    |-- element: struct (containsNull = true)
> //  |    |    |-- a1: string (nullable = true)
> //  |    |    |-- b1: string (nullable = true){code}
> I would expect the schema of the DataFrame written to disk to be the same as that printed out. It seems that instead of using the renamed version of the column names, it uses the old column names.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org