You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by xingye <tr...@gmail.com> on 2016/09/09 17:35:32 UTC

questions about using dapply

I have a question about using UDF in SparkR. I’m converting some R code into SparkR.
• The original R code is :cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))
• If I use dapply and put the original apply function as a function for dapply,cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ y %in% c(61, 99)})},schema )The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no default
• If I use spark.lapply, it still shows the error. It seems in spark, the column cr_cd1 is ambiguous.cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x %in% c(61, 99)}) 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could be: cr_cd1#2169L, cr_cd1#17787L.;

If I use dapplycollect, it works but it will lead to memory issue if data is big. how can the dapply work in my case?wrapper = function(df){out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))return(out)
}cols_in <-dapplyCollect(df,wrapper)

RE: questions about using dapply

Posted by xingye <tr...@gmail.com>.

Hi, Felix
Thanks for the information. as in my previous email, I've made MARGIN capitalized and it worked with dapplyCollect, but it does not work in dapply.

 If I use dapply and put the original apply function as a function for dapply,cols_in <-dapply(df, function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ y %in% c(61, 99)})},schema )The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no default

From: felixcheung_m@hotmail.com
To: user@spark.apache.org; tracy.upenn@gmail.com
Subject: Re: questions about using dapply
Date: Sun, 11 Sep 2016 01:52:37 +0000







You might need MARGIN capitalized, this example works though:



c <- as.DataFrame(cars)
# rename the columns to c1, c2
c <- selectExpr(c, "speed as c1", "dist as c2")
cols_in <- dapplyCollect(c,
function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ y %in% c(61, 99)})})
# dapplyCollect does not require the schema parameter








_____________________________

From: xingye <tr...@gmail.com>

Sent: Friday, September 9, 2016 10:35 AM

Subject: questions about using dapply

To: <us...@spark.apache.org>









I have a question about using UDF in SparkR. I’m converting some R code into SparkR.





• The original R code is :

cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))





• If I use dapply and put the original apply function as a function for dapply,

cols_in <-dapply(df, 

function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ y %in% c(61, 99)})},

schema )

The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no default





• If I use spark.lapply, it still shows the error. It seems in spark, the column cr_cd1 is ambiguous.

cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x %in% c(61, 99)})

 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could be: cr_cd1#2169L, cr_cd1#17787L.;









If I use dapplycollect, it works but it will lead to memory issue if data is big. how can the dapply work in my case?

wrapper = function(df){

out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))

return(out)



}

cols_in <-dapplyCollect(df,wrapper)

Re: questions about using dapply

Posted by Felix Cheung <fe...@hotmail.com>.

You might need MARGIN capitalized, this example works though:

c <- as.DataFrame(cars)
# rename the columns to c1, c2
c <- selectExpr(c, "speed as c1", "dist as c2")
cols_in <- dapplyCollect(c,
function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ y %in% c(61, 99)})})
# dapplyCollect does not require the schema parameter


_____________________________
From: xingye <tr...@gmail.com>>
Sent: Friday, September 9, 2016 10:35 AM
Subject: questions about using dapply
To: <us...@spark.apache.org>>



I have a question about using UDF in SparkR. I'm converting some R code into SparkR.


* The original R code is :

cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))


* If I use dapply and put the original apply function as a function for dapply,

cols_in <-dapply(df,

function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ y %in% c(61, 99)})},

schema )

The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no default


* If I use spark.lapply, it still shows the error. It seems in spark, the column cr_cd1 is ambiguous.

cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x %in% c(61, 99)})

 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could be: cr_cd1#2169L, cr_cd1#17787L.;



  *   If I use dapplycollect, it works but it will lead to memory issue if data is big. how can the dapply work in my case?

wrapper = function(df){

out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", c(61, 99))

return(out)

}

cols_in <-dapplyCollect(df,wrapper)