You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:23:23 UTC
[jira] [Updated] (SPARK-18571) pyspark: UTF-8 not written correctly
(as CSV) when locale is not UTF-8
[ https://issues.apache.org/jira/browse/SPARK-18571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-18571:
---------------------------------
Labels: bulk-closed (was: )
> pyspark: UTF-8 not written correctly (as CSV) when locale is not UTF-8
> ----------------------------------------------------------------------
>
> Key: SPARK-18571
> URL: https://issues.apache.org/jira/browse/SPARK-18571
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.0.2
> Reporter: Adrian Bridgett
> Priority: Major
> Labels: bulk-closed
> Attachments: unicode.py
>
>
> Sample code attached, code run with hadoop 2.7.3, python3.5
> If I run this with --master='local[*]' and LANG=en_US.UTF-8, then in _another_ terminal (which has LANG=en_US.UTF-8 set) cat the file, I see the Pi character I expect.
> Back to the first terminal, set LANG=C (or unset it) and rerun, then check the output in the other terminal (still set to en_US.UTF-8) and it's corrupted.
> I actually noticed this as when I run it with our normal mesos scheduler, the data is corrupted (those boxes do have LANG=en_US.UTF-8 but perhaps it's not being picked up).
> I don't remember needing to do this on Spark-1.6.1 (hadoop-2.7.1).
> Expected characters: 0x80cf
> Received: 0xbfef efbd bdbf
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org