You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (Jira)" <ji...@apache.org> on 2020/01/26 00:47:00 UTC
[jira] [Created] (SPARK-30645) collect() support Unicode charactes
tests fails on Windows
Maciej Szymkiewicz created SPARK-30645:
------------------------------------------
Summary: collect() support Unicode charactes tests fails on Windows
Key: SPARK-30645
URL: https://issues.apache.org/jira/browse/SPARK-30645
Project: Spark
Issue Type: Bug
Components: SparkR, Tests
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz
As-is [test_that("collect() support Unicode characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869] case seems to be system dependent, and doesn't work properly on Windows with CP1252 English locale:
{code:r}
library(SparkR)
SparkR::sparkR.session()
Sys.info()
# sysname release version
# "Windows" "Server x64" "build 17763"
# nodename machine login
# "WIN-5BLT6Q610KH" "x86-64" "Administrator"
# user effective_user
# "Administrator" "Administrator"
Sys.getlocale()
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
lines <- c("{\"name\":\"안녕하세요\"}",
"{\"name\":\"您好\", \"age\":30}",
"{\"name\":\"こんにちは\", \"age\":19}",
"{\"name\":\"Xin chào\"}")
system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)
df <- read.df(jsonPath, "json")
printSchema(df)
# root
# |-- _corrupt_record: string (nullable = true)
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
head(df)
# _corrupt_record age name
# 1 <NA> NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2 <NA> 30 <U+60A8><U+597D>
# 3 <NA> 19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"} NA <NA>
{code}
Problem becomes visible on AppVoyer when testthat is updated to 2.x, but somehow silenced when testthat 1.x is used.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org