You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by xingye <xi...@hotmail.com> on 2016/09/20 02:22:17 UTC

as.Date can't be applied to Spark data frame in SparkR

Hi, all
I've noticed that as.Date can't be applied to Spark data frame. I've created the following UDF and used dapply to change a integer column "aa"  to a date with origin as 1960-01-01.
change_date<-function(df){   df<-as.POSIXlt(as.Date(df$aa, origin = "1960-01-01", tz = "UTC"))   } customSchema<- structType(structField("rc", "integer"),   ....   structField("change_date(x)","timestamp"))
rollup_1_t <- dapply(rollup_1, function(x) { x <- cbind(x,change_date(x))},schema=customSchema)
It works with a small dataset but it takes forever to finish on a big dataset. It does not give a result when I used 'head(rollup_1_t).
 I guess it is because for "change_date" function, it converts the spark data frame back to R data frame, which is slow and would potentially fail. Is there a better solution?
Thanks,Ye

RE: as.Date can't be applied to Spark data frame in SparkR

Posted by xingye <tr...@gmail.com>.

Update:
the job can finish, but takes a long time on a 10M row data. is there a better solution?
From: xing_mayye@hotmail.com
To: user@spark.apache.org
Subject: as.Date can't be applied to Spark data frame in SparkR
Date: Tue, 20 Sep 2016 10:22:17 +0800

Hi, all
I've noticed that as.Date can't be applied to Spark data frame. I've created the following UDF and used dapply to change a integer column "aa"  to a date with origin as 1960-01-01.
change_date<-function(df){   df<-as.POSIXlt(as.Date(df$aa, origin = "1960-01-01", tz = "UTC"))   } customSchema<- structType(structField("rc", "integer"),   ....   structField("change_date(x)","timestamp"))
rollup_1_t <- dapply(rollup_1, function(x) { x <- cbind(x,change_date(x))},schema=customSchema)
It works with a small dataset but it takes forever to finish on a big dataset. It does not give a result when I used 'head(rollup_1_t).
 I guess it is because for "change_date" function, it converts the spark data frame back to R data frame, which is slow and would potentially fail. Is there a better solution?
Thanks,Ye