You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "xuqianjin (JIRA)" <ji...@apache.org> on 2018/11/27 02:00:00 UTC

[jira] [Comment Edited] (SPARK-23410) Unable to read jsons in charset different from UTF-8

    [ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699803#comment-16699803 ] 

xuqianjin edited comment on SPARK-23410 at 11/27/18 1:59 AM:
-------------------------------------------------------------

hi [~maxgekk]  [~hyukjin.kwon] I think there are two things to consider:
1. Even if lineSeps is set, it is still necessary to identify the file bom charset. The charset of lineSep may be inconsistent with the encoding of the file, resulting in parsing errors.
2. For example, commas are different in utf-8, utf-16le, utf-16be, utf-32le and utf32-be. These formats are also supported for lineSeps.
In my opinion, we can try to read the first four bytes of the file on the executor side to identify the encoding of the file.Because once the charset of the file is determined, the charset of lineSeps is also determined.


was (Author: x1q1j1):
hi [~maxgekk]  [~hyukjin.kwon] I think there are two things to consider:
1. Even if lineSeps is set, it is still necessary to identify the file bom charset. The charset of lineSep may be inconsistent with the encoding of the file, resulting in parsing errors.
2. For example, commas are different in utf-8, utf-16le, utf-16be, utf-32le and utf32-be. These formats are also supported for lineSeps.

> Unable to read jsons in charset different from UTF-8
> ----------------------------------------------------
>
>                 Key: SPARK-23410
>                 URL: https://issues.apache.org/jira/browse/SPARK-23410
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Maxim Gekk
>            Priority: Major
>         Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such behavior breaks backward compatibility with Spark 2.2.1 and previous versions that can read json files in UTF-16, UTF-32 and other encodings due to using of the auto detection mechanism of the jackson library. Need to give back to users possibility to read json files in specified charset and/or detect charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org