You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2019/09/18 22:53:00 UTC
[jira] [Updated] (SPARK-29160) Event log file is written without specific charset which should be ideally UTF-8

     [ https://issues.apache.org/jira/browse/SPARK-29160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim updated SPARK-29160:
---------------------------------
    Description: 
This issue is from observation by [~vanzin] : [https://github.com/apache/spark/pull/25670#discussion_r325383512]

Quoting his comment here:

{quote}
This is a long standing bug in the original code, but this should be explicitly setting the charset to UTF-8 (using new PrintWriter(new OutputStreamWriter(...)).

The reader side should too, although doing that now could potentially break old logs... we should open a bug for this.
{quote}

While EventLoggingListener writes to UTF-8 properly when converting to byte[] before writing, it doesn't deal with charset in logEvent().

It should be fixed, but as Marcelo said, we also need to be aware of potential broken of reading old logs.

  was:
This issue is from observation by [~vanzin] : [https://github.com/apache/spark/pull/25670#discussion_r325383512]

Quoting his comment here:
{noformat}
This is a long standing bug in the original code, but this should be explicitly setting the charset to UTF-8 (using new PrintWriter(new OutputStreamWriter(...)).

The reader side should too, although doing that now could potentially break old logs... we should open a bug for this.{noformat}
While EventLoggingListener writes to UTF-8 properly when converting to byte[] before writing, it doesn't deal with charset in logEvent().

It should be fixed, but as Marcelo said, we also need to be aware of potential broken of reading old logs.


> Event log file is written without specific charset which should be ideally UTF-8
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-29160
>                 URL: https://issues.apache.org/jira/browse/SPARK-29160
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> This issue is from observation by [~vanzin] : [https://github.com/apache/spark/pull/25670#discussion_r325383512]
> Quoting his comment here:
> {quote}
> This is a long standing bug in the original code, but this should be explicitly setting the charset to UTF-8 (using new PrintWriter(new OutputStreamWriter(...)).
> The reader side should too, although doing that now could potentially break old logs... we should open a bug for this.
> {quote}
> While EventLoggingListener writes to UTF-8 properly when converting to byte[] before writing, it doesn't deal with charset in logEvent().
> It should be fixed, but as Marcelo said, we also need to be aware of potential broken of reading old logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org