You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:16:59 UTC
[jira] [Resolved] (SPARK-21107) Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8

     [ https://issues.apache.org/jira/browse/SPARK-21107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-21107.
----------------------------------
    Resolution: Incomplete

> Pyspark: ISO-8859-1 column names inconsistently converted to UTF-8
> ------------------------------------------------------------------
>
>                 Key: SPARK-21107
>                 URL: https://issues.apache.org/jira/browse/SPARK-21107
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.0
>         Environment: Windows 7 standalone
>            Reporter: Tavis Barr
>            Priority: Minor
>              Labels: bulk-closed
>
> When I create a column name with ISO-8859-1 (or possibly, I suspect, other non-UTF-8) characters in it, they are sometimes converted to UTF-8, sometimes not.
> Examples:
> >>> df = sc.parallelize([[1,2],[1,4],[2,5],[2,6]]).toDF([u"L\xe0",u"Here"])
> >>> df.show()
> +---+----+
> | Là|Here|
> +---+----+
> |  1|   2|
> |  1|   4|
> |  2|   5|
> |  2|   6|
> +---+----+
> >>> df.columns
> ['L\xc3\xa0', 'Here']
> >>> df.select(u'L\xc3\xa0').show()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 992, in select
>     jdf = self._jdf.select(self._jcols(*cols))
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
>   File "F:\DataScience\spark-2.2.0-SNAPSHOT-bin-hadoop2.7\python\pyspark\sql\utils.py", line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"cannot resolve '`L\xc3\xa0`' given input columns: [L\xe0, Here];;\n'Project ['L\xc3\xa0]\n+- LogicalRDD [L\xe0#14L, Here#15L]\n"
> >>> df.select(u'L\xe0').show()
> +---+
> | Là|
> +---+
> |  1|
> |  1|
> |  2|
> |  2|
> +---+
> >>> df.select(u'L\xe0').collect()[0].asDict()
> {'L\xc3\xa0': 1}
> This does not seem to affect the Scala version:
> scala> val df = sc.parallelize(Seq((1,2),(1,4),(2,5),(2,6))).toDF("L\u00e0","Here")
> df: org.apache.spark.sql.DataFrame = [Lα: int, Here: int]
> scala> df.select("L\u00e0").show()
> [...output elided..]
> +---+
> | Là|
> +---+
> |  1|
> |  1|
> |  2|
> |  2|
> +---+
> scala> df.columns(0).map(c => c.toInt )
> res8: scala.collection.immutable.IndexedSeq[Int] = Vector(76, 224)
> [Note that 224 is \u00e0, i.e., the original value]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org