You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mark Bittmann <mb...@gmail.com> on 2016/09/13 13:18:59 UTC

Character encoding corruption in Spark JDBC connector

Hello Spark community,

I'm reading from a MySQL database into a Spark dataframe using the JDBC
connector functionality, and I'm experiencing some character encoding
issues. The default encoding for MySQL stings is latin1, but the mysql JDBC
connector implementation of "ResultSet.getString()" will return an
incorrect encoding of the data for certain characters such as the "all
rights reserved" char. Instead, you can use "new
String(ResultSet.getBytes())" which will return the correctly encoded
string. I've confirmed this behavior with the mysql connector classes
(i.e., without using the Spark wrapper).

I can see here that the Spark JDBC connector uses getString(), though there
is a note to move to getBytes() for performance reasons:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L389

For some special chars, I can reverse the behavior with a UDF that applies
new String(badString.getBytes("Cp1252") , "UTF-8"), however for some
languages the underlying byte array is irreversibly changed and the data is
corrupted.

I can submit an issue/PR to fix it going forward if "new
String(ResultSet.getBytes())" is the correct approach.

Meanwhile, can anyone offer any recommendations on how to correct this
behavior prior to it getting to a dataframe? I've tried every permutation
of the settings in the JDBC connection url (characterSetResults,
characterEncoding).

I'm on Spark 1.6.

Thanks!