You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Himanshu Arora (Jira)" <ji...@apache.org> on 2022/04/06 07:37:00 UTC

[jira] [Updated] (SPARK-38801) ISO-8859-1 encoding doesn't work for text format

     [ https://issues.apache.org/jira/browse/SPARK-38801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Himanshu Arora updated SPARK-38801:
-----------------------------------
    Attachment: Screenshot 2022-04-06 at 09.29.24.png

> ISO-8859-1 encoding doesn't work for text format
> ------------------------------------------------
>
>                 Key: SPARK-38801
>                 URL: https://issues.apache.org/jira/browse/SPARK-38801
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.1
>         Environment: I tested this issue on Databricks runtime 10.3 (spark 3.2.1, scala 2.12)
>            Reporter: Himanshu Arora
>            Priority: Major
>         Attachments: Screenshot 2022-04-06 at 09.29.24.png, Screenshot 2022-04-06 at 09.30.02.png
>
>
> When reading text files from spark which are not in UTF-8 charset it doesn't work well for foreign language characters (for ex. French chars like è and é). They are all replaced by �. In my case the text files were in ISO-8859-1 encoding.
> After digging into docs, it seems that spark still uses Hadoop's LineRecordReader class for text format which only supports UTF-8. Here's the source code of that class: [LineRecordReader.java|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L154]
>  
> You can see this issue in the screenshot below:
> !image-2022-04-06-09-30-21-751.png!
> As you can see the French word *données* is read as {*}donn�es{*}. The work *Clôturé* is read as {*}Cl�tur�.{*}{*}{*}
>  
> I also read the same text file as CSV format while providing the correct charset value and it works fine in this case as you can see the screenshot below:
> !image-2022-04-06-09-31-45-062.png!
>  
> So this issue is specifically for text format. Therefore reporting this issue. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org