You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marcelo Vanzin (JIRA)" <ji...@apache.org> on 2019/02/12 22:25:00 UTC
[jira] [Updated] (SPARK-26240) [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException

     [ https://issues.apache.org/jira/browse/SPARK-26240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcelo Vanzin updated SPARK-26240:
-----------------------------------
    Component/s:     (was: Spark Core)
                 SQL

> [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26240
>                 URL: https://issues.apache.org/jira/browse/SPARK-26240
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>         Environment: Ubuntu 16.04 LTS (x86_64/deb)
>  
>            Reporter: Ying Wang
>            Priority: Major
>
> I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet file with illegal column headers, and when I had called df = df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called df.show(), pyspark errored out with the failed attribute being the old column name.
> Steps to reproduce:
> - Create a Parquet file from Pandas using this dataframe schema:
> ```python
> In [10]: df.info()
> <class 'pandas.core.frame.DataFrame'>
> Int64Index: 1000 entries, 0 to 999
> Data columns (total 16 columns):
> Record_ID 1000 non-null int64
> registration_dttm 1000 non-null object
> id 1000 non-null int64
> first_name 984 non-null object
> last_name 1000 non-null object
> email 984 non-null object
> gender 933 non-null object
> ip_address 1000 non-null object
> cc 709 non-null float64
> country 1000 non-null object
> birthdate 803 non-null object
> salary 932 non-null float64
> title 803 non-null object
> comments 179 non-null object
> Unnamed: 14 10 non-null object
> Unnamed: 15 9 non-null object
> dtypes: float64(2), int64(2), object(12)
> memory usage: 132.8+ KB
> ```
>  * Open pyspark shell with `pyspark` and read in the Parquet file with `spark.read.format('parquet').load('/path/to/file.parquet')
> Call `spark_df.show()` Note the error with column 'Unnamed: 14'.
> Rename column, replacing illegal space character with underscore character: `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')`
> Call `spark_df.show()` again, and note that the error still shows attribute 'Unnamed: 14' in the error message:
> ```python
> >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet')
> >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')
> >>> newdf.show()
> Traceback (most recent call last):
>  File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
>  return f(*a, **kw)
>  File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString.
> : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> ...
> ```
> I would have thought that there would be a way in order to read in Parquet files such that illegal column names can be changed after the fact with the spark dataframe was generated, and thus this is unintended behavior. Please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org