You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yanbo Liang (JIRA)" <ji...@apache.org> on 2017/09/01 14:47:00 UTC
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a
SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150657#comment-16150657 ]
Yanbo Liang commented on SPARK-21727:
-------------------------------------
I can run successfully with minor change:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(as.list(rep(0, 20)))
mySparkDf <- as.DataFrame(myDf)
collect(mySparkDf)
{code}
This is because rep(0, 20) is not type of list, we should convert it to list explicitly.
{code}
> class(rep(0, 20))
[1] "numeric"
> class(as.list(rep(0, 20)))
[1] "list"
{code}
> Operating on an ArrayType in a SparkR DataFrame throws error
> ------------------------------------------------------------
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
> Issue Type: Bug
> Components: SparkR
> Affects Versions: 2.2.0
> Reporter: Neil McQuarrie
>
> Previously [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer *list* -- i.e., each of the elements in the column embeds an entire R list of integers -- then it seems I can convert this data.frame to a SparkR DataFrame just fine... SparkR treats the column as ArrayType(Double).
> However, any subsequent operation on this SparkR DataFrame appears to throw an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf)
> 'data.frame': 4 obs. of 2 variables:
> $ indices: int 1 2 3 4
> $ data :List of 4
> ..$ : num 0 0 0 0 0 0 0 0 0 0 ...
> ..$ : num 0 0 0 0 0 0 0 0 0 0 ...
> ..$ : num 0 0 0 0 0 0 0 0 0 0 ...
> ..$ : num 0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)
> indices data
> 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
> java.lang.Double is not a valid external type for schema of array<double>
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org