You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/07/12 09:55:46 UTC
[spark] branch branch-3.2 updated: [SPARK-36089][SQL][DOCS] Update the SQL migration guide about encoding auto-detection of CSV files

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
     new 4c7ac5f  [SPARK-36089][SQL][DOCS] Update the SQL migration guide about encoding auto-detection of CSV files
4c7ac5f is described below

commit 4c7ac5fc90ed48b933b3ca2737da64b971682611
Author: Max Gekk <ma...@gmail.com>
AuthorDate: Mon Jul 12 18:54:39 2021 +0900

    [SPARK-36089][SQL][DOCS] Update the SQL migration guide about encoding auto-detection of CSV files
    
    ### What changes were proposed in this pull request?
    In the PR, I propose to update the SQL migration guide, in particular the section about the migration from Spark 2.4 to 3.0. New item informs users about the following issue:
    
    **What**: Spark doesn't detect encoding (charset) in CSV files with BOM correctly. Such files can be read only in the multiLine mode when the CSV option encoding matches to the actual encoding of CSV files. For example, Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the default mode. This is the case of the current ES ticket.
    
    **Why**: In previous Spark versions, encoding wasn't propagated to the underlying library that means the lib tried to detect file encoding automatically. It could success for some encodings that require BOM presents at the beginning of files. Starting from the versions 3.0, users can specify file encoding via the CSV option encoding which has UTF-8 as the default value. Spark propagates such default to the underlying library (uniVocity), and as a consequence this turned off encoding a [...]
    
    **When**: Since Spark 3.0. In particular, the commit https://github.com/apache/spark/commit/2df34db586bec379e40b5cf30021f5b7a2d79271 causes the issue.
    
    **Workaround**: Enabling the encoding auto-detection mechanism in uniVocity by passing null as the value of CSV option encoding. A more recommended approach is to set the encoding option explicitly.
    
    ### Why are the changes needed?
    To improve user experience with Spark SQL. This should help to users in their migration from Spark 2.4.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Should be checked by building docs in GA/jenkins.
    
    Closes #33300 from MaxGekk/csv-encoding-migration-guide.
    
    Authored-by: Max Gekk <ma...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit e788a3fa887951f68fc0b690cad24e48efb9c1a8)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 docs/sql-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index d9c48d3..28e1cd2 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -326,6 +326,8 @@ license: |
 
   - In Spark 3.0, when Avro files are written with user provided non-nullable schema, even the catalyst schema is nullable, Spark is still able to write the files. However, Spark throws runtime NullPointerException if any of the records contains null.
 
+  - In Spark version 2.4 and below, CSV datasource can detect encoding of input files automatically when the files have BOM at the beginning. For instance, CSV datasource can recognize UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE in the multi-line mode (the CSV option `multiLine` is set to `true`). In Spark 3.0, CSV datasource reads input files in encoding specified via the CSV option `encoding` which has the default value of UTF-8. In this way, if file encoding doesn't match to the  [...]
+
 ### Others
 
   - In Spark version 2.4, when a Spark session is created via `cloneSession()`, the newly created Spark session inherits its configuration from its parent `SparkContext` even though the same configuration may exist with a different value in its parent Spark session. In Spark 3.0, the configurations of a parent `SparkSession` have a higher precedence over the parent `SparkContext`. You can restore the old behavior by setting `spark.sql.legacy.sessionInitWithConfigDefaults` to `true`.

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org