You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maciej Szymkiewicz (Jira)" <ji...@apache.org> on 2020/01/26 00:47:00 UTC

[jira] [Created] (SPARK-30645) collect() support Unicode charactes tests fails on Windows

Maciej Szymkiewicz created SPARK-30645:
------------------------------------------

             Summary: collect() support Unicode charactes tests fails on Windows
                 Key: SPARK-30645
                 URL: https://issues.apache.org/jira/browse/SPARK-30645
             Project: Spark
          Issue Type: Bug
          Components: SparkR, Tests
    Affects Versions: 3.0.0
            Reporter: Maciej Szymkiewicz


As-is [test_that("collect() support Unicode characters"|https://github.com/apache/spark/blob/d5b92b24c41b047c64a4d89cc4061ebf534f0995/R/pkg/tests/fulltests/test_sparkSQL.R#L850-L869] case seems to be system dependent, and doesn't work properly on Windows with CP1252 English locale:

 
{code:r}
library(SparkR)
SparkR::sparkR.session()
Sys.info()
#           sysname           release           version 
#         "Windows"      "Server x64"     "build 17763" 
#          nodename           machine             login 
# "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
#              user    effective_user 
#   "Administrator"   "Administrator" 

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
           "{\"name\":\"您好\", \"age\":30}",
           "{\"name\":\"こんにちは\", \"age\":19}",
           "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0


jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")


printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#              _corrupt_record age                                     name
# 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2                       <NA>  30                         <U+60A8><U+597D>
# 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>

{code}

Problem becomes visible on AppVoyer when testthat is updated to 2.x, but somehow silenced when testthat 1.x is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org