You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/07/12 09:47:00 UTC

[jira] [Assigned] (SPARK-36089) Update the SQL migration guide about encoding auto-detection of CSV files

     [ https://issues.apache.org/jira/browse/SPARK-36089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-36089:
------------------------------------

    Assignee: Max Gekk  (was: Apache Spark)

> Update the SQL migration guide about encoding auto-detection of CSV files 
> --------------------------------------------------------------------------
>
>                 Key: SPARK-36089
>                 URL: https://issues.apache.org/jira/browse/SPARK-36089
>             Project: Spark
>          Issue Type: Documentation
>          Components: SQL
>    Affects Versions: 3.2.0, 3.1.3, 3.0.4
>            Reporter: Max Gekk
>            Assignee: Max Gekk
>            Priority: Major
>
> Need to update the SQL migration guide to inform users about behavior change.
> *What*: Spark doesn't detect encoding (charset) in CSV files with BOM correctly. Such files can be read only in the multiLine mode when the CSV option encoding matches to the actual encoding of CSV files. For example, Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the default mode. This is the case of the current ES ticket.
> *Why*: In previous Spark versions, encoding wasn't propagated to the underlying library that means the lib tried to detect file encoding automatically. It could success for some encodings that require BOM presents at the beginning of files. Starting from the versions 3.0, users can specify file encoding via the CSV option encoding which has UTF-8 as the default value. Spark propagates such default to the underlying library (uniVocity), and as a consequence this turned off encoding autodetection.
> *When*: Since Spark 3.0. In particular, the commit https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271 causes the issue.
> *Workaround*: Enabling the encoding auto-detection mechanism in uniVocity by passing null as the value of CSV option encoding. A more recommended approach is to set the encoding option to UTF-16 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org