You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/13 15:16:54 UTC

[GitHub] [spark] wangyum opened a new pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

wangyum opened a new pull request #29088:
URL: https://github.com/apache/spark/pull/29088


   ### What changes were proposed in this pull request?
   
   This pr add a new `bom` CSVOptions to support fix some characters are garbled when opening with Excel.
   
   Before this pr:
   ![image](https://user-images.githubusercontent.com/5399861/87321221-8a882f80-c55e-11ea-85fd-0a26cacbdaf8.png)
   
   
   After this pr and set `bom` to true:
   ![image](https://user-images.githubusercontent.com/5399861/87321310-abe91b80-c55e-11ea-8b12-3b510385a1c0.png)
   
   
   ### Why are the changes needed?
   
   Fix garbled issue.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   Unit test.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657622318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657904952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657904577


   **[Test build #125794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125794/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454812360



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2354,43 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Support write BOM to file before writing data if encoded by UTF-8 charset") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"
+    val japanese = "私は日本人が好き"

Review comment:
       I guess Japanese is the same case @ueshin or @maropu?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657994072


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r453726510



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       More details: https://stackoverflow.com/questions/32072017/write-utf-8-bom-with-supercsv




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657996178






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657625853


   **[Test build #125776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125776/testReport)** for PR 29088 at commit [`bfab2a5`](https://github.com/apache/spark/commit/bfab2a5be81542ba653b0b45085b1e5aeaa3a1e1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657625853


   **[Test build #125776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125776/testReport)** for PR 29088 at commit [`bfab2a5`](https://github.com/apache/spark/commit/bfab2a5be81542ba653b0b45085b1e5aeaa3a1e1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454249166



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       No. `0xEFBBBF` will change the value:
   ![image](https://user-images.githubusercontent.com/5399861/87413662-2b2d2c80-c5fd-11ea-81b9-d363247f02e6.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454170678



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Some characters are garbled when opening csv files with Excel") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"

Review comment:
       Yup!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657974464






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657995711


   **[Test build #125807 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125807/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
stczwd commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454791986



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       Excel. It will change the actual value if we add `0xFEFF` in the front.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658160091


   **[Test build #125833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125833/testReport)** for PR 29088 at commit [`9dde8c2`](https://github.com/apache/spark/commit/9dde8c277dfb7d4925cd4981f7c3183c51f4af8e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658009294


   **[Test build #125807 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125807/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
ueshin commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454835255



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2354,43 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Support write BOM to file before writing data if encoded by UTF-8 charset") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"
+    val japanese = "私は日本人が好き"

Review comment:
       If you mean Japanese as a language, you should use "私は日本語が好き". The current one means "I like Japanese people".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657974472


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125794/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658009412


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125807/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657995711


   **[Test build #125807 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125807/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658160583






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454764920



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       @stczwd What tool will change the value if we use `0xFEFF`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-674633117


   Thank you all. There is a workaround:
   
   ![image](https://user-images.githubusercontent.com/5399861/90353884-22a39800-e07a-11ea-8a7b-37ddb1689fd6.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658325369






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657904577


   **[Test build #125794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125794/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454129942



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Some characters are garbled when opening csv files with Excel") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"

Review comment:
       LOL




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454134455



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       Hm, `0xFEFF` is the BOM for UTF-16 Big Endian, see https://en.wikipedia.org/wiki/Byte_order_mark. Does it work if you specify `0xEFBBBF` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657974464


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657694129


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657622318






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657904952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #29088:
URL: https://github.com/apache/spark/pull/29088


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658009403






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657996178






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454436084



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       Yes, this is the wrong BOM. It may help your case but is going to break others - even adding the BOM might break things. I think you should tell Excel that you need _UTF-16_ when reading the file. That's what you're 'fixing'




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657694140


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125776/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657693900


   **[Test build #125776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125776/testReport)** for PR 29088 at commit [`bfab2a5`](https://github.com/apache/spark/commit/bfab2a5be81542ba653b0b45085b1e5aeaa3a1e1).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658009403


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658160583






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r471141501



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       This change is still wrong. I cannot see merging this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658325369






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454812272



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2354,43 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Support write BOM to file before writing data if encoded by UTF-8 charset") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"

Review comment:
       Oh, @wangyum BTW, do you mean "I like Korean" but Korean as a language? If that's the case, I think you should write like "나는 한국어를 좋아한다". The current one is more like I like Korean people.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454328241



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Some characters are garbled when opening csv files with Excel") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"
+    val japanese = "私は日本人が好き"
+    // scalastyle:on nonascii
+    val english = "I love English"
+
+    val df = spark.sql(s"SELECT '$chinese' AS Chinese, '$korean' AS Korean," +
+      s"'$japanese' AS Japanese, '$english' AS English")
+
+    Seq(true, false).foreach { bom =>
+      withTempPath { p =>
+        val path = p.getAbsolutePath
+        df.write.option("bom", bom).csv(path)
+
+        val bytesReads = new mutable.ArrayBuffer[Long]()
+        val bytesReadListener = new SparkListener() {
+          override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
+            bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
+          }
+        }
+        sparkContext.addSparkListener(bytesReadListener)

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658160091


   **[Test build #125833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125833/testReport)** for PR 29088 at commit [`9dde8c2`](https://github.com/apache/spark/commit/9dde8c277dfb7d4925cd4981f7c3183c51f4af8e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454135608



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Some characters are garbled when opening csv files with Excel") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"
+    val japanese = "私は日本人が好き"
+    // scalastyle:on nonascii
+    val english = "I love English"
+
+    val df = spark.sql(s"SELECT '$chinese' AS Chinese, '$korean' AS Korean," +
+      s"'$japanese' AS Japanese, '$english' AS English")
+
+    Seq(true, false).foreach { bom =>
+      withTempPath { p =>
+        val path = p.getAbsolutePath
+        df.write.option("bom", bom).csv(path)
+
+        val bytesReads = new mutable.ArrayBuffer[Long]()
+        val bytesReadListener = new SparkListener() {
+          override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
+            bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
+          }
+        }
+        sparkContext.addSparkListener(bytesReadListener)

Review comment:
       @wangyum, I think you can have two tests. One is roundtrip, another one is check the size of bytes, for example, via using `binaryFile` source.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454328505



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -135,6 +135,8 @@ class CSVOptions(
   val positiveInf = parameters.getOrElse("positiveInf", "Inf")
   val negativeInf = parameters.getOrElse("negativeInf", "-Inf")
 
+  // Set bom to true to fix some characters are garbled when opening with Excel.
+  val bom = getBool("bom")

Review comment:
       It seem writeBOM only support UTF-8.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657694129






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
wangyum commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454167871



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
##########
@@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa
       assert(df.schema.last == StructField("col_mixed_types", StringType, true))
     }
   }
+
+  test("Some characters are garbled when opening csv files with Excel") {
+    // scalastyle:off nonascii
+    val chinese = "我爱中文"
+    val korean = "나는 한국인을 좋아한다"

Review comment:
       Is it correct? I'm not sure.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454139593



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala
##########
@@ -135,6 +135,8 @@ class CSVOptions(
   val positiveInf = parameters.getOrElse("positiveInf", "Inf")
   val negativeInf = parameters.getOrElse("negativeInf", "-Inf")
 
+  // Set bom to true to fix some characters are garbled when opening with Excel.
+  val bom = getBool("bom")

Review comment:
       I think you should add an assert here that encoding is always UTF-X such as UTF-8, UTF-16, UTF-16EL or UTF-16BL. I would also name it as `writeBOM`.
   You will have to document this as an option at `DataFrameWrtier`, `DataStreamWriter`, `readwrite.py` and `streaming.py`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] stczwd commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
stczwd commented on a change in pull request #29088:
URL: https://github.com/apache/spark/pull/29088#discussion_r454746965



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala
##########
@@ -39,6 +39,10 @@ class CsvOutputWriter(
 
   private val gen = new UnivocityGenerator(dataSchema, writer, params)
 
+  if (params.bom) {
+    writer.write(0xFEFF)

Review comment:
       We meet the same problem in our project, and we use `0xEFBBBF` as default BOM for UTF-8, it will change the value if we use `0xFEFF`. Besides, we have meet other problems, such as the commas were used incorrectly or quotation marks were not displayed properly. If we fix these problems, it will cause other users can not read these files with other tools.
   
   Hm, what I trying to say is, maybe it is not a good idea to change CsvOutputWriter to fit Excel format. It can be done in project before use downloads the csv files or just use Excel to import.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-657974102


   **[Test build #125794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125794/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f).
    * This patch **fails PySpark pip packaging tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29088:
URL: https://github.com/apache/spark/pull/29088#issuecomment-658324651


   **[Test build #125833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125833/testReport)** for PR 29088 at commit [`9dde8c2`](https://github.com/apache/spark/commit/9dde8c277dfb7d4925cd4981f7c3183c51f4af8e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org