You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ying Wang (JIRA)" <ji...@apache.org> on 2018/11/30 22:31:00 UTC
[jira] [Created] (SPARK-26240) [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException

Ying Wang created SPARK-26240:
---------------------------------

             Summary: [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException
                 Key: SPARK-26240
                 URL: https://issues.apache.org/jira/browse/SPARK-26240
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.1
         Environment: Ubuntu 16.04 LTS (x86_64/deb)

 
            Reporter: Ying Wang


I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet file with illegal column headers, and when I had called df = df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called df.show(), pyspark errored out with the failed attribute being the old column name.

Steps to reproduce:

- Create a Parquet file from Pandas using this dataframe schema:

```python

In [10]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 16 columns):
Record_ID 1000 non-null int64
registration_dttm 1000 non-null object
id 1000 non-null int64
first_name 984 non-null object
last_name 1000 non-null object
email 984 non-null object
gender 933 non-null object
ip_address 1000 non-null object
cc 709 non-null float64
country 1000 non-null object
birthdate 803 non-null object
salary 932 non-null float64
title 803 non-null object
comments 179 non-null object
Unnamed: 14 10 non-null object
Unnamed: 15 9 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 132.8+ KB

```
 * Open pyspark shell with `pyspark` and read in the Parquet file with `spark.read.format('parquet').load('/path/to/file.parquet')

Call `spark_df.show()` Note the error with column 'Unnamed: 14'.

Rename column, replacing illegal space character with underscore character: `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')`

Call `spark_df.show()` again, and note that the error still shows attribute 'Unnamed: 14' in the error message:

```python

>>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet')
>>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')
>>> newdf.show()
Traceback (most recent call last):
 File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
 return f(*a, **kw)
 File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString.
: org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

...

```

I would have thought that there would be a way in order to read in Parquet files such that illegal column names can be changed after the fact with the spark dataframe was generated, and thus this is unintended behavior. Please let me know if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org